[llvm] MachineScheduler: Improve instruction clustering (PR #137784)

via llvm-commits llvm-commits at lists.llvm.org
Tue Apr 29 03:35:28 PDT 2025


https://github.com/ruiling created https://github.com/llvm/llvm-project/pull/137784

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.
1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.

>From 900eb8b4e6110d692b0a458c4c5a1e14da3c343e Mon Sep 17 00:00:00 2001
From: Ruiling Song <ruiling.song at amd.com>
Date: Fri, 21 Mar 2025 09:21:46 +0800
Subject: [PATCH] MachineScheduler: Improve instruction clustering

The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the queue.
For example, we have a queue of cluster nodes A B C. But during scheduling,
node B might be picked first, then it will be very likely that we only
cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2.
1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes
2 be pred of 3. This makes both 1 and 2 become preds of 3, but
there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change.
This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.
1. The existing implemention has some buggy behavior:
   The scheduler does not reset the pointer to next cluster candidate.
   For example, we want to cluster A and B, but after picking A, we
   might pick node C. In theory, we should reset the next cluster
   candiate here, because we have decided not to cluster A and B during
   scheduling. Later picking B because of Cluster seems not logical.

2. As the cluster candidates are not ordered now, the candidates might
   be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more deeper
reason that machine scheduler cannot cluster them well both before and
after the change, and the load combine algorithm later is also not smart
enough.

For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets
more buffer_load being claused.
---
 llvm/include/llvm/CodeGen/MachineScheduler.h  |    14 +-
 llvm/include/llvm/CodeGen/ScheduleDAG.h       |     7 +
 llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h |    10 +
 llvm/lib/CodeGen/MachineScheduler.cpp         |    75 +-
 llvm/lib/CodeGen/MacroFusion.cpp              |    13 +
 llvm/lib/CodeGen/ScheduleDAG.cpp              |     3 +
 llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp   |    22 +-
 .../Target/PowerPC/PPCMachineScheduler.cpp    |    18 +-
 .../argument-blocks-array-of-struct.ll        |     3 +-
 .../AArch64/arm64-dagcombiner-load-slicing.ll |    12 +-
 llvm/test/CodeGen/AArch64/bcmp.ll             |     7 +-
 llvm/test/CodeGen/AArch64/expand-select.ll    |    20 +-
 llvm/test/CodeGen/AArch64/extbinopload.ll     |   115 +-
 llvm/test/CodeGen/AArch64/fcmp.ll             |    34 +-
 .../CodeGen/AArch64/fp-conversion-to-tbl.ll   |    30 +-
 llvm/test/CodeGen/AArch64/fptoi.ll            |   140 +-
 .../test/CodeGen/AArch64/fptoui-sat-vector.ll |    32 +-
 llvm/test/CodeGen/AArch64/itofp.ll            |   180 +-
 llvm/test/CodeGen/AArch64/mul.ll              |    24 +-
 llvm/test/CodeGen/AArch64/nontemporal-load.ll |    17 +-
 llvm/test/CodeGen/AArch64/nzcv-save.ll        |    18 +-
 .../AArch64/sve-fixed-vector-llrint.ll        |    86 +-
 .../CodeGen/AArch64/sve-fixed-vector-lrint.ll |    86 +-
 ...e-streaming-mode-fixed-length-bitselect.ll |    94 +-
 ...-streaming-mode-fixed-length-fp-convert.ll |    16 +-
 ...e-streaming-mode-fixed-length-fp-reduce.ll |    24 +-
 ...-streaming-mode-fixed-length-fp-vselect.ll |    24 +-
 ...streaming-mode-fixed-length-int-extends.ll |   162 +-
 ...ve-streaming-mode-fixed-length-int-mulh.ll |    24 +-
 ...-streaming-mode-fixed-length-int-reduce.ll |    32 +-
 ...e-streaming-mode-fixed-length-int-to-fp.ll |   146 +-
 ...streaming-mode-fixed-length-int-vselect.ll |    48 +-
 ...g-mode-fixed-length-permute-zip-uzp-trn.ll |    84 +-
 .../sve-streaming-mode-fixed-length-ptest.ll  |    12 +-
 llvm/test/CodeGen/AArch64/vec_uaddo.ll        |     2 +-
 llvm/test/CodeGen/AArch64/vec_umulo.ll        |     8 +-
 llvm/test/CodeGen/AArch64/vselect-ext.ll      |    30 +-
 .../AArch64/wide-scalar-shift-legalization.ll |    59 +-
 llvm/test/CodeGen/AArch64/zext-to-tbl.ll      |   109 +-
 .../CodeGen/AMDGPU/GlobalISel/add.vni16.ll    |    54 +-
 llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll   |    23 +-
 .../AMDGPU/GlobalISel/extractelement.ll       |    60 +-
 llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll   |   635 +-
 llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll   |   581 +-
 .../AMDGPU/GlobalISel/insertelement.i16.ll    |    23 +-
 .../AMDGPU/GlobalISel/insertelement.i8.ll     |    19 +-
 .../AMDGPU/GlobalISel/insertelement.ll        |    10 +-
 .../GlobalISel/llvm.amdgcn.intersect_ray.ll   |     8 +-
 llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll   |     4 +-
 llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll    |   249 +-
 .../test/CodeGen/AMDGPU/GlobalISel/saddsat.ll |    79 +-
 .../test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll |    38 +-
 .../test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll |    98 +-
 .../CodeGen/AMDGPU/GlobalISel/udiv.i64.ll     |    48 +-
 .../test/CodeGen/AMDGPU/GlobalISel/udivrem.ll |   138 +-
 .../CodeGen/AMDGPU/GlobalISel/urem.i64.ll     |    48 +-
 .../CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll  | 37061 ++++++++--------
 .../CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll   |    26 +-
 .../CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll   |     8 +-
 .../CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll   |   268 +-
 .../CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll   |  7461 ++--
 .../CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll   |   234 +-
 .../CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll   |   356 +-
 .../CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll   |   837 +-
 .../test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll |     4 +-
 .../AMDGPU/amdgpu-cs-chain-preserve-cc.ll     |     4 +-
 .../atomic_optimizations_local_pointer.ll     |    12 +-
 llvm/test/CodeGen/AMDGPU/bf16.ll              |  3365 +-
 .../buffer-fat-pointer-atomicrmw-fadd.ll      |    21 +-
 .../buffer-fat-pointer-atomicrmw-fmax.ll      |    13 +-
 .../buffer-fat-pointer-atomicrmw-fmin.ll      |    13 +-
 .../AMDGPU/buffer-fat-pointers-memcpy.ll      |     8 +-
 .../CodeGen/AMDGPU/call-argument-types.ll     |    30 +-
 .../test/CodeGen/AMDGPU/carryout-selection.ll |     3 +-
 llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll   |    86 +-
 llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll   |    88 +-
 llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll     |    42 +-
 llvm/test/CodeGen/AMDGPU/ds-alignment.ll      |    84 +-
 llvm/test/CodeGen/AMDGPU/ds_read2.ll          |   133 +-
 .../fast-unaligned-load-store.global.ll       |     6 +-
 .../fast-unaligned-load-store.private.ll      |     6 +-
 llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll     |    24 +-
 llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll     |     7 +-
 llvm/test/CodeGen/AMDGPU/fdiv.ll              |     2 +-
 llvm/test/CodeGen/AMDGPU/fmed3.ll             |     4 +-
 .../CodeGen/AMDGPU/fneg-modifier-casting.ll   |     2 +-
 llvm/test/CodeGen/AMDGPU/fp-classify.ll       |     2 +-
 llvm/test/CodeGen/AMDGPU/freeze.ll            |   233 +-
 .../CodeGen/AMDGPU/function-args-inreg.ll     |     6 +-
 llvm/test/CodeGen/AMDGPU/function-args.ll     |   498 +-
 .../AMDGPU/gfx-callable-argument-types.ll     |    45 +-
 .../AMDGPU/gfx-callable-return-types.ll       |    61 +-
 llvm/test/CodeGen/AMDGPU/global_atomics.ll    |     8 +-
 llvm/test/CodeGen/AMDGPU/half.ll              |     2 +-
 llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll        |    32 +-
 llvm/test/CodeGen/AMDGPU/idiv-licm.ll         |     7 +-
 llvm/test/CodeGen/AMDGPU/idot4s.ll            |    18 +-
 llvm/test/CodeGen/AMDGPU/idot4u.ll            |    24 +-
 .../CodeGen/AMDGPU/indirect-addressing-si.ll  |    64 +-
 llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll |    36 +-
 .../CodeGen/AMDGPU/integer-mad-patterns.ll    |    12 +-
 llvm/test/CodeGen/AMDGPU/kernel-args.ll       |    18 +-
 ...vm.amdgcn.global.atomic.ordered.add.b64.ll |     2 +-
 .../AMDGPU/llvm.amdgcn.intersect_ray.ll       |    20 +-
 .../AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll  |     4 +-
 .../AMDGPU/llvm.amdgcn.permlane64.ptr.ll      |    21 +-
 .../AMDGPU/llvm.amdgcn.readfirstlane.ll       |    16 +-
 .../CodeGen/AMDGPU/llvm.amdgcn.writelane.ll   |     4 +-
 llvm/test/CodeGen/AMDGPU/llvm.log.ll          |     7 +-
 llvm/test/CodeGen/AMDGPU/llvm.log10.ll        |     7 +-
 llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll  |   212 +-
 llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll  |   230 +-
 llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll  |   580 +-
 llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll  |    22 +-
 llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll  |   230 +-
 llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll  |   580 +-
 llvm/test/CodeGen/AMDGPU/llvm.round.ll        |     6 +-
 llvm/test/CodeGen/AMDGPU/load-constant-i1.ll  |    41 +-
 llvm/test/CodeGen/AMDGPU/load-constant-i16.ll |   170 +-
 llvm/test/CodeGen/AMDGPU/load-constant-i32.ll |    12 +-
 llvm/test/CodeGen/AMDGPU/load-constant-i8.ll  |    36 +-
 llvm/test/CodeGen/AMDGPU/load-global-i16.ll   |     2 +-
 llvm/test/CodeGen/AMDGPU/load-global-i32.ll   |   341 +-
 .../AMDGPU/load-local-redundant-copies.ll     |    42 +-
 llvm/test/CodeGen/AMDGPU/load-local.128.ll    |    68 +-
 llvm/test/CodeGen/AMDGPU/load-local.96.ll     |    50 +-
 ...er-buffer-fat-pointers-lastuse-metadata.ll |    16 +-
 ...uffer-fat-pointers-nontemporal-metadata.ll |    32 +-
 llvm/test/CodeGen/AMDGPU/max.i16.ll           |     6 +-
 llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll    |   118 +-
 .../AMDGPU/memcpy-param-combinations.ll       |    78 +-
 .../CodeGen/AMDGPU/memintrinsic-unroll.ll     |   954 +-
 .../AMDGPU/memmove-param-combinations.ll      |   205 +-
 llvm/test/CodeGen/AMDGPU/min.ll               |     4 +-
 llvm/test/CodeGen/AMDGPU/mul.ll               |    36 +-
 .../CodeGen/AMDGPU/narrow_math_for_and.ll     |     4 +-
 llvm/test/CodeGen/AMDGPU/or.ll                |     4 +-
 llvm/test/CodeGen/AMDGPU/permute_i8.ll        |   114 +-
 llvm/test/CodeGen/AMDGPU/pr51516.mir          |     6 +-
 .../AMDGPU/promote-constOffset-to-imm.ll      |   146 +-
 llvm/test/CodeGen/AMDGPU/repeated-divisor.ll  |     4 +-
 llvm/test/CodeGen/AMDGPU/sdiv.ll              |   192 +-
 llvm/test/CodeGen/AMDGPU/select.f16.ll        |   341 +-
 llvm/test/CodeGen/AMDGPU/shl.ll               |    10 +-
 .../AMDGPU/splitkit-getsubrangeformask.ll     |    14 +-
 llvm/test/CodeGen/AMDGPU/sra.ll               |    20 +-
 llvm/test/CodeGen/AMDGPU/srem.ll              |    12 +-
 llvm/test/CodeGen/AMDGPU/srl.ll               |    10 +-
 llvm/test/CodeGen/AMDGPU/store-local.128.ll   |    57 +-
 llvm/test/CodeGen/AMDGPU/store-local.96.ll    |    29 +-
 llvm/test/CodeGen/AMDGPU/sub.ll               |    30 +-
 llvm/test/CodeGen/AMDGPU/udivrem.ll           |     8 +-
 llvm/test/CodeGen/PowerPC/p10-fi-elim.ll      |     4 +-
 ...lar-shift-by-byte-multiple-legalization.ll |    68 +-
 llvm/test/CodeGen/RISCV/abds-neg.ll           |    60 +-
 llvm/test/CodeGen/RISCV/abds.ll               |   800 +-
 llvm/test/CodeGen/RISCV/abdu-neg.ll           |    52 +-
 llvm/test/CodeGen/RISCV/add-before-shl.ll     |    20 +-
 llvm/test/CodeGen/RISCV/fold-mem-offset.ll    |    16 +-
 llvm/test/CodeGen/RISCV/legalize-fneg.ll      |    10 +-
 llvm/test/CodeGen/RISCV/memcmp-optsize.ll     |    84 +-
 llvm/test/CodeGen/RISCV/memcmp.ll             |    84 +-
 llvm/test/CodeGen/RISCV/rv32zbb.ll            |     2 +-
 .../CodeGen/RISCV/rvv/fixed-vectors-elen.ll   |    34 +-
 .../RISCV/rvv/fixed-vectors-int-buildvec.ll   |   296 +-
 .../RISCV/rvv/fixed-vectors-masked-gather.ll  |    10 +-
 llvm/test/CodeGen/RISCV/rvv/pr125306.ll       |    16 +-
 llvm/test/CodeGen/RISCV/scmp.ll               |     2 +-
 llvm/test/CodeGen/RISCV/srem-vector-lkk.ll    |    48 +-
 llvm/test/CodeGen/RISCV/ucmp.ll               |     2 +-
 .../CodeGen/RISCV/unaligned-load-store.ll     |    32 +-
 llvm/test/CodeGen/RISCV/urem-vector-lkk.ll    |    36 +-
 llvm/test/CodeGen/RISCV/vararg.ll             |    18 +-
 ...lar-shift-by-byte-multiple-legalization.ll |   718 +-
 .../RISCV/wide-scalar-shift-legalization.ll   |   402 +-
 llvm/test/CodeGen/RISCV/xtheadmempair.ll      |    14 +-
 176 files changed, 31670 insertions(+), 31540 deletions(-)

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHECK-DAG: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
 ; CHECK: add {{x[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{x[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
diff --git a/llvm/test/CodeGen/AArch64/bcmp.ll b/llvm/test/CodeGen/AArch64/bcmp.ll
index fee52ead98962..e70ddc3415cac 100644
--- a/llvm/test/CodeGen/AArch64/bcmp.ll
+++ b/llvm/test/CodeGen/AArch64/bcmp.ll
@@ -494,13 +494,14 @@ define i1 @bcmp_i128(i128 %a0, i128 %b0, i128 %a1, i128 %b1, i128 %a2, i128 %b2)
 ; CHECK-LABEL: bcmp_i128:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    cmp x2, x0
-; CHECK-NEXT:    ldp x8, x10, [sp]
+; CHECK-NEXT:    ldp x10, x8, [sp, #8]
 ; CHECK-NEXT:    ccmp x3, x1, #0, eq
-; CHECK-NEXT:    ldp x9, x11, [sp, #16]
+; CHECK-NEXT:    ldr x9, [sp]
+; CHECK-NEXT:    ldr x11, [sp, #24]
 ; CHECK-NEXT:    ccmp x6, x4, #0, eq
 ; CHECK-NEXT:    ccmp x7, x5, #0, eq
 ; CHECK-NEXT:    cset w12, ne
-; CHECK-NEXT:    cmp x9, x8
+; CHECK-NEXT:    cmp x8, x9
 ; CHECK-NEXT:    ccmp x11, x10, #0, eq
 ; CHECK-NEXT:    csinc w0, w12, wzr, eq
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/expand-select.ll b/llvm/test/CodeGen/AArch64/expand-select.ll
index 1ed2e09c6b4d4..7ca6adb1338d3 100644
--- a/llvm/test/CodeGen/AArch64/expand-select.ll
+++ b/llvm/test/CodeGen/AArch64/expand-select.ll
@@ -8,11 +8,11 @@ define void @foo(i32 %In1, <2 x i128> %In2, <2 x i128> %In3, ptr %Out) {
 ; CHECK-NEXT:    fmov s0, wzr
 ; CHECK-NEXT:    ldr x11, [sp]
 ; CHECK-NEXT:    fmov s1, w8
-; CHECK-NEXT:    ldp x9, x10, [sp, #8]
+; CHECK-NEXT:    ldp x8, x10, [sp, #8]
 ; CHECK-NEXT:    cmeq v0.4s, v1.4s, v0.4s
-; CHECK-NEXT:    fmov w8, s0
-; CHECK-NEXT:    tst w8, #0x1
-; CHECK-NEXT:    csel x8, x5, x9, ne
+; CHECK-NEXT:    fmov w9, s0
+; CHECK-NEXT:    tst w9, #0x1
+; CHECK-NEXT:    csel x8, x5, x8, ne
 ; CHECK-NEXT:    csel x9, x4, x11, ne
 ; CHECK-NEXT:    stp x9, x8, [x10, #16]
 ; CHECK-NEXT:    csel x8, x3, x7, ne
@@ -36,14 +36,14 @@ define void @bar(i32 %In1, <2 x i96> %In2, <2 x i96> %In3, ptr %Out) {
 ; CHECK-NEXT:    ldr x10, [sp, #16]
 ; CHECK-NEXT:    fmov s1, w8
 ; CHECK-NEXT:    cmeq v0.4s, v1.4s, v0.4s
-; CHECK-NEXT:    fmov w8, s0
-; CHECK-NEXT:    tst w8, #0x1
-; CHECK-NEXT:    ldp x9, x8, [sp]
+; CHECK-NEXT:    fmov w9, s0
+; CHECK-NEXT:    tst w9, #0x1
+; CHECK-NEXT:    ldp x8, x9, [sp]
 ; CHECK-NEXT:    csel x11, x2, x6, ne
 ; CHECK-NEXT:    str x11, [x10]
-; CHECK-NEXT:    csel x9, x4, x9, ne
-; CHECK-NEXT:    csel x8, x5, x8, ne
-; CHECK-NEXT:    stur x9, [x10, #12]
+; CHECK-NEXT:    csel x8, x4, x8, ne
+; CHECK-NEXT:    stur x8, [x10, #12]
+; CHECK-NEXT:    csel x8, x5, x9, ne
 ; CHECK-NEXT:    csel x9, x3, x7, ne
 ; CHECK-NEXT:    str w8, [x10, #20]
 ; CHECK-NEXT:    str w9, [x10, #8]
diff --git a/llvm/test/CodeGen/AArch64/extbinopload.ll b/llvm/test/CodeGen/AArch64/extbinopload.ll
index 82114d60c4a93..cabb0e7278e40 100644
--- a/llvm/test/CodeGen/AArch64/extbinopload.ll
+++ b/llvm/test/CodeGen/AArch64/extbinopload.ll
@@ -667,30 +667,30 @@ define <16 x i32> @extrause_load(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-NEXT:    add x10, x3, #12
 ; CHECK-NEXT:    bic v1.8h, #255, lsl #8
 ; CHECK-NEXT:    ld1 { v0.s }[3], [x3], #4
-; CHECK-NEXT:    ldr s3, [x0, #12]
-; CHECK-NEXT:    ldp s2, s7, [x0, #4]
+; CHECK-NEXT:    ldr s4, [x0, #12]
+; CHECK-NEXT:    ldp s5, s2, [x2, #4]
 ; CHECK-NEXT:    ldr s6, [x2, #12]
-; CHECK-NEXT:    ldp s5, s4, [x2, #4]
-; CHECK-NEXT:    ld1 { v3.s }[1], [x11]
+; CHECK-NEXT:    ldp s3, s7, [x0, #4]
+; CHECK-NEXT:    ld1 { v4.s }[1], [x11]
 ; CHECK-NEXT:    ld1 { v6.s }[1], [x10]
-; CHECK-NEXT:    ld1 { v2.s }[1], [x9]
-; CHECK-NEXT:    ld1 { v4.s }[1], [x8]
+; CHECK-NEXT:    ld1 { v2.s }[1], [x8]
 ; CHECK-NEXT:    ld1 { v5.s }[1], [x3]
 ; CHECK-NEXT:    add x8, x1, #8
+; CHECK-NEXT:    ld1 { v3.s }[1], [x9]
 ; CHECK-NEXT:    ld1 { v7.s }[1], [x8]
-; CHECK-NEXT:    uaddl v2.8h, v2.8b, v3.8b
-; CHECK-NEXT:    ushll v4.8h, v4.8b, #0
-; CHECK-NEXT:    uaddl v3.8h, v5.8b, v6.8b
+; CHECK-NEXT:    ushll v2.8h, v2.8b, #0
+; CHECK-NEXT:    uaddl v3.8h, v3.8b, v4.8b
+; CHECK-NEXT:    uaddl v4.8h, v5.8b, v6.8b
 ; CHECK-NEXT:    uaddw v1.8h, v1.8h, v7.8b
-; CHECK-NEXT:    uaddw2 v4.8h, v4.8h, v0.16b
-; CHECK-NEXT:    ushll v0.4s, v2.4h, #3
-; CHECK-NEXT:    ushll v5.4s, v3.4h, #3
+; CHECK-NEXT:    uaddw2 v2.8h, v2.8h, v0.16b
+; CHECK-NEXT:    ushll v0.4s, v3.4h, #3
+; CHECK-NEXT:    ushll v5.4s, v4.4h, #3
+; CHECK-NEXT:    ushll2 v4.4s, v4.8h, #3
 ; CHECK-NEXT:    ushll2 v3.4s, v3.8h, #3
-; CHECK-NEXT:    ushll2 v2.4s, v2.8h, #3
 ; CHECK-NEXT:    uaddw v0.4s, v0.4s, v1.4h
-; CHECK-NEXT:    uaddw2 v1.4s, v2.4s, v1.8h
-; CHECK-NEXT:    uaddw2 v3.4s, v3.4s, v4.8h
-; CHECK-NEXT:    uaddw v2.4s, v5.4s, v4.4h
+; CHECK-NEXT:    uaddw2 v1.4s, v3.4s, v1.8h
+; CHECK-NEXT:    uaddw2 v3.4s, v4.4s, v2.8h
+; CHECK-NEXT:    uaddw v2.4s, v5.4s, v2.4h
 ; CHECK-NEXT:    ret
   %lp1 = load <4 x i8>, ptr %p
   store <4 x i8> %lp1, ptr %z
@@ -861,7 +861,7 @@ define <16 x i32> @extrause_shuffle(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 define <16 x i32> @extrause_ext(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-LABEL: extrause_ext:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp s0, s3, [x2]
+; CHECK-NEXT:    ldp s0, s5, [x2]
 ; CHECK-NEXT:    add x8, x3, #8
 ; CHECK-NEXT:    add x9, x3, #12
 ; CHECK-NEXT:    add x10, x1, #8
@@ -871,26 +871,26 @@ define <16 x i32> @extrause_ext(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-NEXT:    ld1 { v1.s }[1], [x1], #4
 ; CHECK-NEXT:    ld1 { v2.s }[1], [x1]
 ; CHECK-NEXT:    ldp s7, s4, [x0, #8]
-; CHECK-NEXT:    ld1 { v3.s }[1], [x3]
-; CHECK-NEXT:    ldp s6, s5, [x2, #8]
+; CHECK-NEXT:    ld1 { v5.s }[1], [x3]
+; CHECK-NEXT:    ldp s6, s3, [x2, #8]
 ; CHECK-NEXT:    ld1 { v4.s }[1], [x11]
 ; CHECK-NEXT:    ld1 { v7.s }[1], [x10]
-; CHECK-NEXT:    ld1 { v5.s }[1], [x9]
+; CHECK-NEXT:    ld1 { v3.s }[1], [x9]
 ; CHECK-NEXT:    ld1 { v6.s }[1], [x8]
 ; CHECK-NEXT:    uaddl v2.8h, v2.8b, v4.8b
 ; CHECK-NEXT:    uaddl v1.8h, v1.8b, v7.8b
 ; CHECK-NEXT:    ushll v4.8h, v4.8b, #0
-; CHECK-NEXT:    uaddl v3.8h, v3.8b, v5.8b
+; CHECK-NEXT:    uaddl v5.8h, v5.8b, v3.8b
 ; CHECK-NEXT:    uaddl v6.8h, v0.8b, v6.8b
-; CHECK-NEXT:    ushll v5.8h, v5.8b, #0
+; CHECK-NEXT:    ushll v16.8h, v3.8b, #0
 ; CHECK-NEXT:    ushll v0.4s, v2.4h, #3
 ; CHECK-NEXT:    ushll2 v2.4s, v2.8h, #3
-; CHECK-NEXT:    ushll v7.4s, v3.4h, #3
-; CHECK-NEXT:    ushll2 v3.4s, v3.8h, #3
-; CHECK-NEXT:    stp q4, q5, [x4]
+; CHECK-NEXT:    ushll v7.4s, v5.4h, #3
+; CHECK-NEXT:    ushll2 v5.4s, v5.8h, #3
+; CHECK-NEXT:    stp q4, q16, [x4]
 ; CHECK-NEXT:    uaddw v0.4s, v0.4s, v1.4h
 ; CHECK-NEXT:    uaddw2 v1.4s, v2.4s, v1.8h
-; CHECK-NEXT:    uaddw2 v3.4s, v3.4s, v6.8h
+; CHECK-NEXT:    uaddw2 v3.4s, v5.4s, v6.8h
 ; CHECK-NEXT:    uaddw v2.4s, v7.4s, v6.4h
 ; CHECK-NEXT:    ret
   %lp1 = load <4 x i8>, ptr %p
@@ -960,7 +960,7 @@ define <16 x i32> @extrause_ext(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 define <16 x i32> @extrause_add(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-LABEL: extrause_add:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp s0, s4, [x2]
+; CHECK-NEXT:    ldp s0, s5, [x2]
 ; CHECK-NEXT:    add x8, x3, #8
 ; CHECK-NEXT:    add x9, x3, #12
 ; CHECK-NEXT:    add x10, x1, #8
@@ -970,15 +970,15 @@ define <16 x i32> @extrause_add(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-NEXT:    ld1 { v1.s }[1], [x1], #4
 ; CHECK-NEXT:    ld1 { v2.s }[1], [x1]
 ; CHECK-NEXT:    ldp s7, s3, [x0, #8]
-; CHECK-NEXT:    ld1 { v4.s }[1], [x3]
-; CHECK-NEXT:    ldp s6, s5, [x2, #8]
+; CHECK-NEXT:    ld1 { v5.s }[1], [x3]
+; CHECK-NEXT:    ldp s6, s4, [x2, #8]
 ; CHECK-NEXT:    ld1 { v3.s }[1], [x11]
 ; CHECK-NEXT:    ld1 { v7.s }[1], [x10]
-; CHECK-NEXT:    ld1 { v5.s }[1], [x9]
+; CHECK-NEXT:    ld1 { v4.s }[1], [x9]
 ; CHECK-NEXT:    ld1 { v6.s }[1], [x8]
 ; CHECK-NEXT:    uaddl v16.8h, v2.8b, v3.8b
 ; CHECK-NEXT:    uaddl v1.8h, v1.8b, v7.8b
-; CHECK-NEXT:    uaddl v4.8h, v4.8b, v5.8b
+; CHECK-NEXT:    uaddl v4.8h, v5.8b, v4.8b
 ; CHECK-NEXT:    uaddl v2.8h, v0.8b, v6.8b
 ; CHECK-NEXT:    ushll v0.4s, v16.4h, #3
 ; CHECK-NEXT:    ushll2 v6.4s, v16.8h, #3
@@ -1073,24 +1073,24 @@ define <16 x i32> @extrause_ext2(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-NEXT:    ld1 { v6.s }[1], [x10]
 ; CHECK-NEXT:    ld1 { v5.s }[1], [x9]
 ; CHECK-NEXT:    ld1 { v7.s }[1], [x8]
-; CHECK-NEXT:    uaddl v16.8h, v2.8b, v3.8b
-; CHECK-NEXT:    uaddl v3.8h, v1.8b, v6.8b
-; CHECK-NEXT:    uaddl v2.8h, v4.8b, v5.8b
+; CHECK-NEXT:    uaddl v2.8h, v2.8b, v3.8b
+; CHECK-NEXT:    uaddl v1.8h, v1.8b, v6.8b
+; CHECK-NEXT:    uaddl v3.8h, v4.8b, v5.8b
 ; CHECK-NEXT:    uaddl v4.8h, v0.8b, v7.8b
-; CHECK-NEXT:    ushll v0.4s, v16.4h, #3
-; CHECK-NEXT:    ushll2 v1.4s, v16.8h, #3
-; CHECK-NEXT:    ushll2 v18.4s, v16.8h, #0
-; CHECK-NEXT:    ushll v6.4s, v2.4h, #3
-; CHECK-NEXT:    ushll2 v7.4s, v2.8h, #3
-; CHECK-NEXT:    ushll2 v5.4s, v2.8h, #0
+; CHECK-NEXT:    ushll2 v0.4s, v2.8h, #0
+; CHECK-NEXT:    ushll v5.4s, v2.4h, #3
+; CHECK-NEXT:    ushll2 v16.4s, v2.8h, #3
+; CHECK-NEXT:    ushll v6.4s, v3.4h, #3
+; CHECK-NEXT:    ushll2 v7.4s, v3.8h, #3
 ; CHECK-NEXT:    ushll v17.4s, v2.4h, #0
-; CHECK-NEXT:    uaddw2 v1.4s, v1.4s, v3.8h
-; CHECK-NEXT:    uaddw v0.4s, v0.4s, v3.4h
+; CHECK-NEXT:    ushll2 v18.4s, v3.8h, #0
+; CHECK-NEXT:    ushll v19.4s, v3.4h, #0
+; CHECK-NEXT:    stp q17, q0, [x4]
+; CHECK-NEXT:    uaddw v0.4s, v5.4s, v1.4h
+; CHECK-NEXT:    uaddw2 v1.4s, v16.4s, v1.8h
 ; CHECK-NEXT:    uaddw2 v3.4s, v7.4s, v4.8h
 ; CHECK-NEXT:    uaddw v2.4s, v6.4s, v4.4h
-; CHECK-NEXT:    ushll v4.4s, v16.4h, #0
-; CHECK-NEXT:    stp q17, q5, [x4, #32]
-; CHECK-NEXT:    stp q4, q18, [x4]
+; CHECK-NEXT:    stp q19, q18, [x4, #32]
 ; CHECK-NEXT:    ret
   %lp1 = load <4 x i8>, ptr %p
   %p2 = getelementptr i8, ptr %p, i32 4
@@ -1176,19 +1176,20 @@ define <16 x i32> @extrause_shl(ptr %p, ptr %q, ptr %r, ptr %s, ptr %z) {
 ; CHECK-NEXT:    ld1 { v5.s }[1], [x9]
 ; CHECK-NEXT:    ld1 { v7.s }[1], [x8]
 ; CHECK-NEXT:    uaddl v2.8h, v2.8b, v3.8b
+; CHECK-NEXT:    uaddl v1.8h, v1.8b, v6.8b
 ; CHECK-NEXT:    uaddl v3.8h, v4.8b, v5.8b
-; CHECK-NEXT:    uaddl v4.8h, v1.8b, v6.8b
-; CHECK-NEXT:    ushll v5.4s, v2.4h, #3
-; CHECK-NEXT:    ushll2 v6.4s, v2.8h, #3
-; CHECK-NEXT:    uaddl v2.8h, v0.8b, v7.8b
-; CHECK-NEXT:    ushll v7.4s, v3.4h, #3
-; CHECK-NEXT:    ushll2 v16.4s, v3.8h, #3
-; CHECK-NEXT:    uaddw2 v1.4s, v6.4s, v4.8h
-; CHECK-NEXT:    uaddw v0.4s, v5.4s, v4.4h
-; CHECK-NEXT:    stp q5, q6, [x4]
-; CHECK-NEXT:    uaddw2 v3.4s, v16.4s, v2.8h
-; CHECK-NEXT:    uaddw v2.4s, v7.4s, v2.4h
-; CHECK-NEXT:    stp q7, q16, [x4, #32]
+; CHECK-NEXT:    uaddl v5.8h, v0.8b, v7.8b
+; CHECK-NEXT:    ushll v4.4s, v2.4h, #3
+; CHECK-NEXT:    ushll2 v2.4s, v2.8h, #3
+; CHECK-NEXT:    ushll v6.4s, v3.4h, #3
+; CHECK-NEXT:    ushll2 v7.4s, v3.8h, #3
+; CHECK-NEXT:    uaddw v0.4s, v4.4s, v1.4h
+; CHECK-NEXT:    uaddw2 v1.4s, v2.4s, v1.8h
+; CHECK-NEXT:    str q4, [x4]
+; CHECK-NEXT:    stp q2, q6, [x4, #16]
+; CHECK-NEXT:    uaddw2 v3.4s, v7.4s, v5.8h
+; CHECK-NEXT:    uaddw v2.4s, v6.4s, v5.4h
+; CHECK-NEXT:    str q7, [x4, #48]
 ; CHECK-NEXT:    ret
   %lp1 = load <4 x i8>, ptr %p
   %p2 = getelementptr i8, ptr %p, i32 4
diff --git a/llvm/test/CodeGen/AArch64/fcmp.ll b/llvm/test/CodeGen/AArch64/fcmp.ll
index 66f26fc9d8597..b3dd47ce28bfb 100644
--- a/llvm/test/CodeGen/AArch64/fcmp.ll
+++ b/llvm/test/CodeGen/AArch64/fcmp.ll
@@ -2040,16 +2040,16 @@ define <16 x i32> @v16f16_i32(<16 x half> %a, <16 x half> %b, <16 x i32> %d, <16
 ; CHECK-SD-NOFP16-NEXT:    fcvt s17, h0
 ; CHECK-SD-NOFP16-NEXT:    csetm w18, mi
 ; CHECK-SD-NOFP16-NEXT:    fcmp s19, s18
-; CHECK-SD-NOFP16-NEXT:    fmov s18, w14
 ; CHECK-SD-NOFP16-NEXT:    fmov s19, w17
+; CHECK-SD-NOFP16-NEXT:    fmov s18, w14
 ; CHECK-SD-NOFP16-NEXT:    csetm w0, mi
 ; CHECK-SD-NOFP16-NEXT:    fcmp s3, s1
 ; CHECK-SD-NOFP16-NEXT:    mov h1, v2.h[2]
 ; CHECK-SD-NOFP16-NEXT:    mov h3, v0.h[2]
 ; CHECK-SD-NOFP16-NEXT:    mov h2, v2.h[3]
 ; CHECK-SD-NOFP16-NEXT:    mov h0, v0.h[3]
-; CHECK-SD-NOFP16-NEXT:    mov v18.h[1], w12
 ; CHECK-SD-NOFP16-NEXT:    mov v19.h[1], w16
+; CHECK-SD-NOFP16-NEXT:    mov v18.h[1], w12
 ; CHECK-SD-NOFP16-NEXT:    csetm w1, mi
 ; CHECK-SD-NOFP16-NEXT:    fcmp s17, s16
 ; CHECK-SD-NOFP16-NEXT:    fmov s16, w10
@@ -2057,34 +2057,34 @@ define <16 x i32> @v16f16_i32(<16 x half> %a, <16 x half> %b, <16 x i32> %d, <16
 ; CHECK-SD-NOFP16-NEXT:    fcvt s3, h3
 ; CHECK-SD-NOFP16-NEXT:    fcvt s2, h2
 ; CHECK-SD-NOFP16-NEXT:    fcvt s0, h0
-; CHECK-SD-NOFP16-NEXT:    csetm w2, mi
+; CHECK-SD-NOFP16-NEXT:    csetm w10, mi
 ; CHECK-SD-NOFP16-NEXT:    mov v16.h[1], w8
-; CHECK-SD-NOFP16-NEXT:    mov v18.h[2], w13
-; CHECK-SD-NOFP16-NEXT:    fmov s17, w2
 ; CHECK-SD-NOFP16-NEXT:    mov v19.h[2], w18
+; CHECK-SD-NOFP16-NEXT:    fmov s17, w10
+; CHECK-SD-NOFP16-NEXT:    mov v18.h[2], w13
 ; CHECK-SD-NOFP16-NEXT:    fcmp s3, s1
 ; CHECK-SD-NOFP16-NEXT:    mov v17.h[1], w1
 ; CHECK-SD-NOFP16-NEXT:    mov v16.h[2], w9
-; CHECK-SD-NOFP16-NEXT:    mov v18.h[3], w15
 ; CHECK-SD-NOFP16-NEXT:    mov v19.h[3], w0
+; CHECK-SD-NOFP16-NEXT:    mov v18.h[3], w15
 ; CHECK-SD-NOFP16-NEXT:    csetm w8, mi
 ; CHECK-SD-NOFP16-NEXT:    fcmp s0, s2
 ; CHECK-SD-NOFP16-NEXT:    mov v17.h[2], w8
 ; CHECK-SD-NOFP16-NEXT:    mov v16.h[3], w11
 ; CHECK-SD-NOFP16-NEXT:    csetm w8, mi
+; CHECK-SD-NOFP16-NEXT:    sshll v2.4s, v18.4h, #0
 ; CHECK-SD-NOFP16-NEXT:    mov v17.h[3], w8
-; CHECK-SD-NOFP16-NEXT:    sshll v2.4s, v16.4h, #0
-; CHECK-SD-NOFP16-NEXT:    sshll v16.4s, v18.4h, #0
-; CHECK-SD-NOFP16-NEXT:    ldp q0, q18, [sp]
-; CHECK-SD-NOFP16-NEXT:    sshll v1.4s, v17.4h, #0
-; CHECK-SD-NOFP16-NEXT:    sshll v17.4s, v19.4h, #0
-; CHECK-SD-NOFP16-NEXT:    ldp q19, q3, [sp, #32]
-; CHECK-SD-NOFP16-NEXT:    bit v0.16b, v4.16b, v1.16b
-; CHECK-SD-NOFP16-NEXT:    mov v1.16b, v17.16b
-; CHECK-SD-NOFP16-NEXT:    bit v3.16b, v7.16b, v2.16b
-; CHECK-SD-NOFP16-NEXT:    mov v2.16b, v16.16b
-; CHECK-SD-NOFP16-NEXT:    bsl v1.16b, v5.16b, v18.16b
+; CHECK-SD-NOFP16-NEXT:    sshll v1.4s, v16.4h, #0
+; CHECK-SD-NOFP16-NEXT:    sshll v16.4s, v19.4h, #0
+; CHECK-SD-NOFP16-NEXT:    ldp q19, q18, [sp, #32]
+; CHECK-SD-NOFP16-NEXT:    sshll v0.4s, v17.4h, #0
+; CHECK-SD-NOFP16-NEXT:    ldp q3, q17, [sp]
 ; CHECK-SD-NOFP16-NEXT:    bsl v2.16b, v6.16b, v19.16b
+; CHECK-SD-NOFP16-NEXT:    bsl v0.16b, v4.16b, v3.16b
+; CHECK-SD-NOFP16-NEXT:    mov v3.16b, v1.16b
+; CHECK-SD-NOFP16-NEXT:    mov v1.16b, v16.16b
+; CHECK-SD-NOFP16-NEXT:    bsl v3.16b, v7.16b, v18.16b
+; CHECK-SD-NOFP16-NEXT:    bsl v1.16b, v5.16b, v17.16b
 ; CHECK-SD-NOFP16-NEXT:    ret
 ;
 ; CHECK-SD-FP16-LABEL: v16f16_i32:
diff --git a/llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll b/llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll
index d9d80f1cb50ee..1fbca7ca2c27c 100644
--- a/llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll
@@ -118,12 +118,12 @@ define void @fptoui_2x_v8f32_to_v8i8_in_loop(ptr %A, ptr %B, ptr %dst) {
 ; CHECK-NEXT:    add x10, x0, x9
 ; CHECK-NEXT:    add x9, x1, x9
 ; CHECK-NEXT:    ldp q2, q1, [x10]
-; CHECK-NEXT:    fcvtzu.4s v5, v1
-; CHECK-NEXT:    ldp q1, q3, [x9]
-; CHECK-NEXT:    fcvtzu.4s v4, v2
-; CHECK-NEXT:    fcvtzu.4s v7, v3
+; CHECK-NEXT:    fcvtzu.4s v4, v1
+; CHECK-NEXT:    ldp q7, q1, [x9]
+; CHECK-NEXT:    fcvtzu.4s v3, v2
 ; CHECK-NEXT:    fcvtzu.4s v6, v1
-; CHECK-NEXT:    tbl.16b v1, { v4, v5, v6, v7 }, v0
+; CHECK-NEXT:    fcvtzu.4s v5, v7
+; CHECK-NEXT:    tbl.16b v1, { v3, v4, v5, v6 }, v0
 ; CHECK-NEXT:    str q1, [x2, x8, lsl #4]
 ; CHECK-NEXT:    add x8, x8, #1
 ; CHECK-NEXT:    cmp x8, #1000
@@ -185,12 +185,12 @@ define void @fptoui_2x_v8f32_to_v8i8_in_loop_no_concat_shuffle(ptr %A, ptr %B, p
 ; CHECK-NEXT:    add x10, x0, x9
 ; CHECK-NEXT:    add x9, x1, x9
 ; CHECK-NEXT:    ldp q2, q1, [x10]
-; CHECK-NEXT:    fcvtzu.4s v5, v1
-; CHECK-NEXT:    ldp q1, q3, [x9]
-; CHECK-NEXT:    fcvtzu.4s v4, v2
-; CHECK-NEXT:    fcvtzu.4s v7, v3
+; CHECK-NEXT:    fcvtzu.4s v4, v1
+; CHECK-NEXT:    ldp q7, q1, [x9]
+; CHECK-NEXT:    fcvtzu.4s v3, v2
 ; CHECK-NEXT:    fcvtzu.4s v6, v1
-; CHECK-NEXT:    tbl.16b v1, { v4, v5, v6, v7 }, v0
+; CHECK-NEXT:    fcvtzu.4s v5, v7
+; CHECK-NEXT:    tbl.16b v1, { v3, v4, v5, v6 }, v0
 ; CHECK-NEXT:    str q1, [x2, x8, lsl #4]
 ; CHECK-NEXT:    add x8, x8, #1
 ; CHECK-NEXT:    cmp x8, #1000
@@ -252,12 +252,12 @@ define void @fptoui_v16f32_to_v16i8_in_loop(ptr %A, ptr %dst) {
 ; CHECK-NEXT:    add x8, x8, #1
 ; CHECK-NEXT:    cmp x8, #1000
 ; CHECK-NEXT:    ldp q2, q1, [x9, #32]
-; CHECK-NEXT:    fcvtzu.4s v7, v1
-; CHECK-NEXT:    ldp q1, q3, [x9]
-; CHECK-NEXT:    fcvtzu.4s v6, v2
-; CHECK-NEXT:    fcvtzu.4s v5, v3
+; CHECK-NEXT:    fcvtzu.4s v6, v1
+; CHECK-NEXT:    ldp q7, q1, [x9]
+; CHECK-NEXT:    fcvtzu.4s v5, v2
 ; CHECK-NEXT:    fcvtzu.4s v4, v1
-; CHECK-NEXT:    tbl.16b v1, { v4, v5, v6, v7 }, v0
+; CHECK-NEXT:    fcvtzu.4s v3, v7
+; CHECK-NEXT:    tbl.16b v1, { v3, v4, v5, v6 }, v0
 ; CHECK-NEXT:    str q1, [x1], #32
 ; CHECK-NEXT:    b.eq LBB4_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
diff --git a/llvm/test/CodeGen/AArch64/fptoi.ll b/llvm/test/CodeGen/AArch64/fptoi.ll
index 9c4f0207b84ce..ae3b6a54a1f7f 100644
--- a/llvm/test/CodeGen/AArch64/fptoi.ll
+++ b/llvm/test/CodeGen/AArch64/fptoi.ll
@@ -2825,42 +2825,42 @@ define <32 x i64> @fptos_v32f32_v32i64(<32 x float> %a) {
 ; CHECK-SD-NEXT:    fcvtl v7.2d, v7.2s
 ; CHECK-SD-NEXT:    fcvtl2 v17.2d, v6.4s
 ; CHECK-SD-NEXT:    fcvtl v6.2d, v6.2s
-; CHECK-SD-NEXT:    fcvtl2 v18.2d, v5.4s
-; CHECK-SD-NEXT:    fcvtl v5.2d, v5.2s
+; CHECK-SD-NEXT:    fcvtl2 v21.2d, v2.4s
+; CHECK-SD-NEXT:    fcvtl v2.2d, v2.2s
 ; CHECK-SD-NEXT:    fcvtl2 v19.2d, v4.4s
 ; CHECK-SD-NEXT:    fcvtl v4.2d, v4.2s
+; CHECK-SD-NEXT:    fcvtl2 v18.2d, v5.4s
 ; CHECK-SD-NEXT:    fcvtl2 v20.2d, v3.4s
+; CHECK-SD-NEXT:    fcvtl v5.2d, v5.2s
 ; CHECK-SD-NEXT:    fcvtl v3.2d, v3.2s
 ; CHECK-SD-NEXT:    fcvtzs v16.2d, v16.2d
 ; CHECK-SD-NEXT:    fcvtzs v7.2d, v7.2d
 ; CHECK-SD-NEXT:    fcvtzs v17.2d, v17.2d
 ; CHECK-SD-NEXT:    fcvtzs v6.2d, v6.2d
+; CHECK-SD-NEXT:    fcvtzs v2.2d, v2.2d
+; CHECK-SD-NEXT:    fcvtzs v19.2d, v19.2d
+; CHECK-SD-NEXT:    fcvtzs v4.2d, v4.2d
 ; CHECK-SD-NEXT:    fcvtzs v18.2d, v18.2d
+; CHECK-SD-NEXT:    fcvtzs v20.2d, v20.2d
 ; CHECK-SD-NEXT:    fcvtzs v5.2d, v5.2d
-; CHECK-SD-NEXT:    fcvtzs v4.2d, v4.2d
 ; CHECK-SD-NEXT:    fcvtzs v3.2d, v3.2d
 ; CHECK-SD-NEXT:    stp q7, q16, [x8, #224]
-; CHECK-SD-NEXT:    fcvtl2 v7.2d, v2.4s
-; CHECK-SD-NEXT:    fcvtzs v16.2d, v19.2d
-; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
-; CHECK-SD-NEXT:    fcvtl v2.2d, v2.2s
-; CHECK-SD-NEXT:    fcvtl2 v5.2d, v0.4s
+; CHECK-SD-NEXT:    fcvtzs v16.2d, v21.2d
 ; CHECK-SD-NEXT:    stp q6, q17, [x8, #192]
-; CHECK-SD-NEXT:    fcvtl2 v6.2d, v1.4s
-; CHECK-SD-NEXT:    fcvtzs v17.2d, v20.2d
+; CHECK-SD-NEXT:    fcvtl2 v17.2d, v1.4s
 ; CHECK-SD-NEXT:    fcvtl v1.2d, v1.2s
+; CHECK-SD-NEXT:    stp q4, q19, [x8, #128]
+; CHECK-SD-NEXT:    stp q3, q20, [x8, #96]
+; CHECK-SD-NEXT:    stp q2, q16, [x8, #64]
+; CHECK-SD-NEXT:    fcvtl2 v16.2d, v0.4s
 ; CHECK-SD-NEXT:    fcvtl v0.2d, v0.2s
-; CHECK-SD-NEXT:    stp q4, q16, [x8, #128]
-; CHECK-SD-NEXT:    fcvtzs v7.2d, v7.2d
-; CHECK-SD-NEXT:    fcvtzs v2.2d, v2.2d
-; CHECK-SD-NEXT:    fcvtzs v4.2d, v6.2d
-; CHECK-SD-NEXT:    stp q3, q17, [x8, #96]
-; CHECK-SD-NEXT:    fcvtzs v3.2d, v5.2d
+; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
+; CHECK-SD-NEXT:    fcvtzs v6.2d, v17.2d
 ; CHECK-SD-NEXT:    fcvtzs v1.2d, v1.2d
+; CHECK-SD-NEXT:    fcvtzs v4.2d, v16.2d
 ; CHECK-SD-NEXT:    fcvtzs v0.2d, v0.2d
-; CHECK-SD-NEXT:    stp q2, q7, [x8, #64]
-; CHECK-SD-NEXT:    stp q0, q3, [x8]
-; CHECK-SD-NEXT:    stp q1, q4, [x8, #32]
+; CHECK-SD-NEXT:    stp q1, q6, [x8, #32]
+; CHECK-SD-NEXT:    stp q0, q4, [x8]
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: fptos_v32f32_v32i64:
@@ -2918,42 +2918,42 @@ define <32 x i64> @fptou_v32f32_v32i64(<32 x float> %a) {
 ; CHECK-SD-NEXT:    fcvtl v7.2d, v7.2s
 ; CHECK-SD-NEXT:    fcvtl2 v17.2d, v6.4s
 ; CHECK-SD-NEXT:    fcvtl v6.2d, v6.2s
-; CHECK-SD-NEXT:    fcvtl2 v18.2d, v5.4s
-; CHECK-SD-NEXT:    fcvtl v5.2d, v5.2s
+; CHECK-SD-NEXT:    fcvtl2 v21.2d, v2.4s
+; CHECK-SD-NEXT:    fcvtl v2.2d, v2.2s
 ; CHECK-SD-NEXT:    fcvtl2 v19.2d, v4.4s
 ; CHECK-SD-NEXT:    fcvtl v4.2d, v4.2s
+; CHECK-SD-NEXT:    fcvtl2 v18.2d, v5.4s
 ; CHECK-SD-NEXT:    fcvtl2 v20.2d, v3.4s
+; CHECK-SD-NEXT:    fcvtl v5.2d, v5.2s
 ; CHECK-SD-NEXT:    fcvtl v3.2d, v3.2s
 ; CHECK-SD-NEXT:    fcvtzu v16.2d, v16.2d
 ; CHECK-SD-NEXT:    fcvtzu v7.2d, v7.2d
 ; CHECK-SD-NEXT:    fcvtzu v17.2d, v17.2d
 ; CHECK-SD-NEXT:    fcvtzu v6.2d, v6.2d
+; CHECK-SD-NEXT:    fcvtzu v2.2d, v2.2d
+; CHECK-SD-NEXT:    fcvtzu v19.2d, v19.2d
+; CHECK-SD-NEXT:    fcvtzu v4.2d, v4.2d
 ; CHECK-SD-NEXT:    fcvtzu v18.2d, v18.2d
+; CHECK-SD-NEXT:    fcvtzu v20.2d, v20.2d
 ; CHECK-SD-NEXT:    fcvtzu v5.2d, v5.2d
-; CHECK-SD-NEXT:    fcvtzu v4.2d, v4.2d
 ; CHECK-SD-NEXT:    fcvtzu v3.2d, v3.2d
 ; CHECK-SD-NEXT:    stp q7, q16, [x8, #224]
-; CHECK-SD-NEXT:    fcvtl2 v7.2d, v2.4s
-; CHECK-SD-NEXT:    fcvtzu v16.2d, v19.2d
-; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
-; CHECK-SD-NEXT:    fcvtl v2.2d, v2.2s
-; CHECK-SD-NEXT:    fcvtl2 v5.2d, v0.4s
+; CHECK-SD-NEXT:    fcvtzu v16.2d, v21.2d
 ; CHECK-SD-NEXT:    stp q6, q17, [x8, #192]
-; CHECK-SD-NEXT:    fcvtl2 v6.2d, v1.4s
-; CHECK-SD-NEXT:    fcvtzu v17.2d, v20.2d
+; CHECK-SD-NEXT:    fcvtl2 v17.2d, v1.4s
 ; CHECK-SD-NEXT:    fcvtl v1.2d, v1.2s
+; CHECK-SD-NEXT:    stp q4, q19, [x8, #128]
+; CHECK-SD-NEXT:    stp q3, q20, [x8, #96]
+; CHECK-SD-NEXT:    stp q2, q16, [x8, #64]
+; CHECK-SD-NEXT:    fcvtl2 v16.2d, v0.4s
 ; CHECK-SD-NEXT:    fcvtl v0.2d, v0.2s
-; CHECK-SD-NEXT:    stp q4, q16, [x8, #128]
-; CHECK-SD-NEXT:    fcvtzu v7.2d, v7.2d
-; CHECK-SD-NEXT:    fcvtzu v2.2d, v2.2d
-; CHECK-SD-NEXT:    fcvtzu v4.2d, v6.2d
-; CHECK-SD-NEXT:    stp q3, q17, [x8, #96]
-; CHECK-SD-NEXT:    fcvtzu v3.2d, v5.2d
+; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
+; CHECK-SD-NEXT:    fcvtzu v6.2d, v17.2d
 ; CHECK-SD-NEXT:    fcvtzu v1.2d, v1.2d
+; CHECK-SD-NEXT:    fcvtzu v4.2d, v16.2d
 ; CHECK-SD-NEXT:    fcvtzu v0.2d, v0.2d
-; CHECK-SD-NEXT:    stp q2, q7, [x8, #64]
-; CHECK-SD-NEXT:    stp q0, q3, [x8]
-; CHECK-SD-NEXT:    stp q1, q4, [x8, #32]
+; CHECK-SD-NEXT:    stp q1, q6, [x8, #32]
+; CHECK-SD-NEXT:    stp q0, q4, [x8]
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: fptou_v32f32_v32i64:
@@ -5244,45 +5244,45 @@ define <32 x i64> @fptos_v32f16_v32i64(<32 x half> %a) {
 ; CHECK-GI-FP16-NEXT:    mov v17.d[1], v23.d[0]
 ; CHECK-GI-FP16-NEXT:    mov v1.d[1], v29.d[0]
 ; CHECK-GI-FP16-NEXT:    mov v19.d[1], v30.d[0]
-; CHECK-GI-FP16-NEXT:    mov h21, v3.h[1]
+; CHECK-GI-FP16-NEXT:    mov h16, v3.h[1]
 ; CHECK-GI-FP16-NEXT:    stp q6, q5, [x8, #32]
 ; CHECK-GI-FP16-NEXT:    mov v20.d[1], v22.d[0]
-; CHECK-GI-FP16-NEXT:    mov h16, v3.h[2]
+; CHECK-GI-FP16-NEXT:    mov h21, v3.h[2]
 ; CHECK-GI-FP16-NEXT:    mov h7, v3.h[3]
 ; CHECK-GI-FP16-NEXT:    mov h22, v3.h[4]
-; CHECK-GI-FP16-NEXT:    mov h23, v3.h[5]
-; CHECK-GI-FP16-NEXT:    mov h6, v3.h[6]
+; CHECK-GI-FP16-NEXT:    mov h6, v3.h[5]
+; CHECK-GI-FP16-NEXT:    mov h23, v3.h[6]
 ; CHECK-GI-FP16-NEXT:    mov h5, v3.h[7]
 ; CHECK-GI-FP16-NEXT:    mov v18.d[1], v24.d[0]
 ; CHECK-GI-FP16-NEXT:    mov v2.d[1], v25.d[0]
 ; CHECK-GI-FP16-NEXT:    fcvt d3, h3
-; CHECK-GI-FP16-NEXT:    fcvt d21, h21
-; CHECK-GI-FP16-NEXT:    fcvtzs v0.2d, v0.2d
 ; CHECK-GI-FP16-NEXT:    fcvt d16, h16
+; CHECK-GI-FP16-NEXT:    fcvtzs v0.2d, v0.2d
+; CHECK-GI-FP16-NEXT:    fcvt d21, h21
 ; CHECK-GI-FP16-NEXT:    fcvtzs v4.2d, v4.2d
 ; CHECK-GI-FP16-NEXT:    fcvt d7, h7
 ; CHECK-GI-FP16-NEXT:    fcvt d22, h22
-; CHECK-GI-FP16-NEXT:    fcvt d23, h23
-; CHECK-GI-FP16-NEXT:    fcvtzs v1.2d, v1.2d
 ; CHECK-GI-FP16-NEXT:    fcvt d6, h6
+; CHECK-GI-FP16-NEXT:    fcvtzs v1.2d, v1.2d
+; CHECK-GI-FP16-NEXT:    fcvt d23, h23
 ; CHECK-GI-FP16-NEXT:    fcvt d5, h5
 ; CHECK-GI-FP16-NEXT:    fcvtzs v19.2d, v19.2d
-; CHECK-GI-FP16-NEXT:    mov v3.d[1], v21.d[0]
-; CHECK-GI-FP16-NEXT:    fcvtzs v20.2d, v20.2d
+; CHECK-GI-FP16-NEXT:    mov v3.d[1], v16.d[0]
+; CHECK-GI-FP16-NEXT:    fcvtzs v16.2d, v20.2d
 ; CHECK-GI-FP16-NEXT:    stp q0, q4, [x8, #64]
 ; CHECK-GI-FP16-NEXT:    fcvtzs v0.2d, v17.2d
 ; CHECK-GI-FP16-NEXT:    fcvtzs v4.2d, v18.2d
-; CHECK-GI-FP16-NEXT:    mov v16.d[1], v7.d[0]
-; CHECK-GI-FP16-NEXT:    mov v22.d[1], v23.d[0]
-; CHECK-GI-FP16-NEXT:    mov v6.d[1], v5.d[0]
+; CHECK-GI-FP16-NEXT:    mov v21.d[1], v7.d[0]
+; CHECK-GI-FP16-NEXT:    mov v22.d[1], v6.d[0]
+; CHECK-GI-FP16-NEXT:    mov v23.d[1], v5.d[0]
 ; CHECK-GI-FP16-NEXT:    stp q1, q19, [x8, #96]
 ; CHECK-GI-FP16-NEXT:    fcvtzs v1.2d, v2.2d
 ; CHECK-GI-FP16-NEXT:    fcvtzs v2.2d, v3.2d
-; CHECK-GI-FP16-NEXT:    stp q20, q0, [x8, #128]
-; CHECK-GI-FP16-NEXT:    fcvtzs v0.2d, v16.2d
+; CHECK-GI-FP16-NEXT:    stp q16, q0, [x8, #128]
+; CHECK-GI-FP16-NEXT:    fcvtzs v0.2d, v21.2d
 ; CHECK-GI-FP16-NEXT:    fcvtzs v3.2d, v22.2d
 ; CHECK-GI-FP16-NEXT:    stp q4, q1, [x8, #160]
-; CHECK-GI-FP16-NEXT:    fcvtzs v1.2d, v6.2d
+; CHECK-GI-FP16-NEXT:    fcvtzs v1.2d, v23.2d
 ; CHECK-GI-FP16-NEXT:    stp q2, q0, [x8, #192]
 ; CHECK-GI-FP16-NEXT:    stp q3, q1, [x8, #224]
 ; CHECK-GI-FP16-NEXT:    ret
@@ -5645,45 +5645,45 @@ define <32 x i64> @fptou_v32f16_v32i64(<32 x half> %a) {
 ; CHECK-GI-FP16-NEXT:    mov v17.d[1], v23.d[0]
 ; CHECK-GI-FP16-NEXT:    mov v1.d[1], v29.d[0]
 ; CHECK-GI-FP16-NEXT:    mov v19.d[1], v30.d[0]
-; CHECK-GI-FP16-NEXT:    mov h21, v3.h[1]
+; CHECK-GI-FP16-NEXT:    mov h16, v3.h[1]
 ; CHECK-GI-FP16-NEXT:    stp q6, q5, [x8, #32]
 ; CHECK-GI-FP16-NEXT:    mov v20.d[1], v22.d[0]
-; CHECK-GI-FP16-NEXT:    mov h16, v3.h[2]
+; CHECK-GI-FP16-NEXT:    mov h21, v3.h[2]
 ; CHECK-GI-FP16-NEXT:    mov h7, v3.h[3]
 ; CHECK-GI-FP16-NEXT:    mov h22, v3.h[4]
-; CHECK-GI-FP16-NEXT:    mov h23, v3.h[5]
-; CHECK-GI-FP16-NEXT:    mov h6, v3.h[6]
+; CHECK-GI-FP16-NEXT:    mov h6, v3.h[5]
+; CHECK-GI-FP16-NEXT:    mov h23, v3.h[6]
 ; CHECK-GI-FP16-NEXT:    mov h5, v3.h[7]
 ; CHECK-GI-FP16-NEXT:    mov v18.d[1], v24.d[0]
 ; CHECK-GI-FP16-NEXT:    mov v2.d[1], v25.d[0]
 ; CHECK-GI-FP16-NEXT:    fcvt d3, h3
-; CHECK-GI-FP16-NEXT:    fcvt d21, h21
-; CHECK-GI-FP16-NEXT:    fcvtzu v0.2d, v0.2d
 ; CHECK-GI-FP16-NEXT:    fcvt d16, h16
+; CHECK-GI-FP16-NEXT:    fcvtzu v0.2d, v0.2d
+; CHECK-GI-FP16-NEXT:    fcvt d21, h21
 ; CHECK-GI-FP16-NEXT:    fcvtzu v4.2d, v4.2d
 ; CHECK-GI-FP16-NEXT:    fcvt d7, h7
 ; CHECK-GI-FP16-NEXT:    fcvt d22, h22
-; CHECK-GI-FP16-NEXT:    fcvt d23, h23
-; CHECK-GI-FP16-NEXT:    fcvtzu v1.2d, v1.2d
 ; CHECK-GI-FP16-NEXT:    fcvt d6, h6
+; CHECK-GI-FP16-NEXT:    fcvtzu v1.2d, v1.2d
+; CHECK-GI-FP16-NEXT:    fcvt d23, h23
 ; CHECK-GI-FP16-NEXT:    fcvt d5, h5
 ; CHECK-GI-FP16-NEXT:    fcvtzu v19.2d, v19.2d
-; CHECK-GI-FP16-NEXT:    mov v3.d[1], v21.d[0]
-; CHECK-GI-FP16-NEXT:    fcvtzu v20.2d, v20.2d
+; CHECK-GI-FP16-NEXT:    mov v3.d[1], v16.d[0]
+; CHECK-GI-FP16-NEXT:    fcvtzu v16.2d, v20.2d
 ; CHECK-GI-FP16-NEXT:    stp q0, q4, [x8, #64]
 ; CHECK-GI-FP16-NEXT:    fcvtzu v0.2d, v17.2d
 ; CHECK-GI-FP16-NEXT:    fcvtzu v4.2d, v18.2d
-; CHECK-GI-FP16-NEXT:    mov v16.d[1], v7.d[0]
-; CHECK-GI-FP16-NEXT:    mov v22.d[1], v23.d[0]
-; CHECK-GI-FP16-NEXT:    mov v6.d[1], v5.d[0]
+; CHECK-GI-FP16-NEXT:    mov v21.d[1], v7.d[0]
+; CHECK-GI-FP16-NEXT:    mov v22.d[1], v6.d[0]
+; CHECK-GI-FP16-NEXT:    mov v23.d[1], v5.d[0]
 ; CHECK-GI-FP16-NEXT:    stp q1, q19, [x8, #96]
 ; CHECK-GI-FP16-NEXT:    fcvtzu v1.2d, v2.2d
 ; CHECK-GI-FP16-NEXT:    fcvtzu v2.2d, v3.2d
-; CHECK-GI-FP16-NEXT:    stp q20, q0, [x8, #128]
-; CHECK-GI-FP16-NEXT:    fcvtzu v0.2d, v16.2d
+; CHECK-GI-FP16-NEXT:    stp q16, q0, [x8, #128]
+; CHECK-GI-FP16-NEXT:    fcvtzu v0.2d, v21.2d
 ; CHECK-GI-FP16-NEXT:    fcvtzu v3.2d, v22.2d
 ; CHECK-GI-FP16-NEXT:    stp q4, q1, [x8, #160]
-; CHECK-GI-FP16-NEXT:    fcvtzu v1.2d, v6.2d
+; CHECK-GI-FP16-NEXT:    fcvtzu v1.2d, v23.2d
 ; CHECK-GI-FP16-NEXT:    stp q2, q0, [x8, #192]
 ; CHECK-GI-FP16-NEXT:    stp q3, q1, [x8, #224]
 ; CHECK-GI-FP16-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll b/llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll
index a01644678b25f..6f6031b90b31d 100644
--- a/llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll
+++ b/llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll
@@ -3520,31 +3520,31 @@ define <8 x i100> @test_unsigned_v8f16_v8i100(<8 x half> %f) {
 ; CHECK-SD-NEXT:    fmov s0, s8
 ; CHECK-SD-NEXT:    bl __fixunssfti
 ; CHECK-SD-NEXT:    extr x8, x21, x27, #28
-; CHECK-SD-NEXT:    extr x9, x29, x20, #28
+; CHECK-SD-NEXT:    str x24, [x19]
+; CHECK-SD-NEXT:    bfi x22, x20, #36, #28
 ; CHECK-SD-NEXT:    stur x28, [x19, #75]
+; CHECK-SD-NEXT:    extr x9, x29, x20, #28
 ; CHECK-SD-NEXT:    fcmp s8, #0.0
-; CHECK-SD-NEXT:    bfi x22, x20, #36, #28
-; CHECK-SD-NEXT:    lsr x11, x29, #28
 ; CHECK-SD-NEXT:    stur x8, [x19, #41]
-; CHECK-SD-NEXT:    str x9, [x19, #16]
-; CHECK-SD-NEXT:    ldr x10, [sp, #32] // 8-byte Folded Reload
+; CHECK-SD-NEXT:    ldr x11, [sp, #32] // 8-byte Folded Reload
+; CHECK-SD-NEXT:    stp x22, x9, [x19, #8]
+; CHECK-SD-NEXT:    lsr x9, x29, #28
 ; CHECK-SD-NEXT:    csel x8, xzr, x0, lt
-; CHECK-SD-NEXT:    csel x9, xzr, x1, lt
+; CHECK-SD-NEXT:    csel x10, xzr, x1, lt
 ; CHECK-SD-NEXT:    fcmp s8, s9
-; CHECK-SD-NEXT:    stp x24, x22, [x19]
-; CHECK-SD-NEXT:    stur x10, [x19, #50]
-; CHECK-SD-NEXT:    lsr x10, x21, #28
-; CHECK-SD-NEXT:    strb w11, [x19, #24]
-; CHECK-SD-NEXT:    strb w10, [x19, #49]
-; CHECK-SD-NEXT:    csel x9, x23, x9, gt
+; CHECK-SD-NEXT:    stur x11, [x19, #50]
+; CHECK-SD-NEXT:    lsr x11, x21, #28
+; CHECK-SD-NEXT:    strb w9, [x19, #24]
+; CHECK-SD-NEXT:    strb w11, [x19, #49]
+; CHECK-SD-NEXT:    csel x10, x23, x10, gt
 ; CHECK-SD-NEXT:    csinv x8, x8, xzr, le
 ; CHECK-SD-NEXT:    ldp x12, x11, [sp] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    bfi x9, x27, #36, #28
+; CHECK-SD-NEXT:    bfi x10, x27, #36, #28
 ; CHECK-SD-NEXT:    stur x8, [x19, #25]
-; CHECK-SD-NEXT:    stur x9, [x19, #33]
-; CHECK-SD-NEXT:    extr x10, x11, x12, #28
+; CHECK-SD-NEXT:    stur x10, [x19, #33]
+; CHECK-SD-NEXT:    extr x9, x11, x12, #28
 ; CHECK-SD-NEXT:    bfi x26, x12, #36, #28
-; CHECK-SD-NEXT:    stur x10, [x19, #91]
+; CHECK-SD-NEXT:    stur x9, [x19, #91]
 ; CHECK-SD-NEXT:    ldp x10, x9, [sp, #16] // 16-byte Folded Reload
 ; CHECK-SD-NEXT:    stur x26, [x19, #83]
 ; CHECK-SD-NEXT:    extr x8, x9, x10, #28
diff --git a/llvm/test/CodeGen/AArch64/itofp.ll b/llvm/test/CodeGen/AArch64/itofp.ll
index 07957c117868d..864941aa31c7b 100644
--- a/llvm/test/CodeGen/AArch64/itofp.ll
+++ b/llvm/test/CodeGen/AArch64/itofp.ll
@@ -2243,46 +2243,46 @@ entry:
 define <32 x double> @stofp_v32i32_v32f64(<32 x i32> %a) {
 ; CHECK-SD-LABEL: stofp_v32i32_v32f64:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    sshll2 v16.2d, v7.4s, #0
-; CHECK-SD-NEXT:    sshll v7.2d, v7.2s, #0
 ; CHECK-SD-NEXT:    sshll2 v17.2d, v6.4s, #0
 ; CHECK-SD-NEXT:    sshll v6.2d, v6.2s, #0
-; CHECK-SD-NEXT:    sshll2 v19.2d, v4.4s, #0
+; CHECK-SD-NEXT:    sshll2 v16.2d, v7.4s, #0
+; CHECK-SD-NEXT:    sshll v19.2d, v3.2s, #0
+; CHECK-SD-NEXT:    sshll v7.2d, v7.2s, #0
+; CHECK-SD-NEXT:    sshll2 v3.2d, v3.4s, #0
+; CHECK-SD-NEXT:    sshll2 v20.2d, v4.4s, #0
 ; CHECK-SD-NEXT:    sshll v4.2d, v4.2s, #0
 ; CHECK-SD-NEXT:    sshll2 v18.2d, v5.4s, #0
-; CHECK-SD-NEXT:    sshll v5.2d, v5.2s, #0
-; CHECK-SD-NEXT:    scvtf v16.2d, v16.2d
-; CHECK-SD-NEXT:    scvtf v7.2d, v7.2d
 ; CHECK-SD-NEXT:    scvtf v17.2d, v17.2d
 ; CHECK-SD-NEXT:    scvtf v6.2d, v6.2d
+; CHECK-SD-NEXT:    scvtf v16.2d, v16.2d
+; CHECK-SD-NEXT:    scvtf v7.2d, v7.2d
+; CHECK-SD-NEXT:    scvtf v3.2d, v3.2d
+; CHECK-SD-NEXT:    sshll2 v21.2d, v2.4s, #0
+; CHECK-SD-NEXT:    scvtf v20.2d, v20.2d
 ; CHECK-SD-NEXT:    scvtf v4.2d, v4.2d
+; CHECK-SD-NEXT:    sshll v5.2d, v5.2s, #0
+; CHECK-SD-NEXT:    sshll v2.2d, v2.2s, #0
 ; CHECK-SD-NEXT:    scvtf v18.2d, v18.2d
-; CHECK-SD-NEXT:    scvtf v5.2d, v5.2d
-; CHECK-SD-NEXT:    stp q7, q16, [x8, #224]
-; CHECK-SD-NEXT:    sshll2 v16.2d, v3.4s, #0
-; CHECK-SD-NEXT:    sshll v3.2d, v3.2s, #0
-; CHECK-SD-NEXT:    scvtf v7.2d, v19.2d
 ; CHECK-SD-NEXT:    stp q6, q17, [x8, #192]
-; CHECK-SD-NEXT:    sshll2 v17.2d, v2.4s, #0
-; CHECK-SD-NEXT:    sshll v2.2d, v2.2s, #0
-; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
-; CHECK-SD-NEXT:    scvtf v6.2d, v16.2d
-; CHECK-SD-NEXT:    scvtf v3.2d, v3.2d
-; CHECK-SD-NEXT:    sshll2 v16.2d, v1.4s, #0
-; CHECK-SD-NEXT:    sshll v1.2d, v1.2s, #0
-; CHECK-SD-NEXT:    scvtf v5.2d, v17.2d
-; CHECK-SD-NEXT:    stp q4, q7, [x8, #128]
+; CHECK-SD-NEXT:    scvtf v17.2d, v19.2d
+; CHECK-SD-NEXT:    stp q7, q16, [x8, #224]
 ; CHECK-SD-NEXT:    sshll2 v7.2d, v0.4s, #0
 ; CHECK-SD-NEXT:    sshll v0.2d, v0.2s, #0
+; CHECK-SD-NEXT:    stp q4, q20, [x8, #128]
+; CHECK-SD-NEXT:    scvtf v16.2d, v21.2d
+; CHECK-SD-NEXT:    scvtf v5.2d, v5.2d
 ; CHECK-SD-NEXT:    scvtf v2.2d, v2.2d
-; CHECK-SD-NEXT:    scvtf v4.2d, v16.2d
-; CHECK-SD-NEXT:    stp q3, q6, [x8, #96]
-; CHECK-SD-NEXT:    scvtf v1.2d, v1.2d
-; CHECK-SD-NEXT:    scvtf v3.2d, v7.2d
+; CHECK-SD-NEXT:    stp q17, q3, [x8, #96]
+; CHECK-SD-NEXT:    sshll2 v3.2d, v1.4s, #0
+; CHECK-SD-NEXT:    sshll v1.2d, v1.2s, #0
+; CHECK-SD-NEXT:    scvtf v4.2d, v7.2d
 ; CHECK-SD-NEXT:    scvtf v0.2d, v0.2d
-; CHECK-SD-NEXT:    stp q2, q5, [x8, #64]
-; CHECK-SD-NEXT:    stp q1, q4, [x8, #32]
-; CHECK-SD-NEXT:    stp q0, q3, [x8]
+; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
+; CHECK-SD-NEXT:    stp q2, q16, [x8, #64]
+; CHECK-SD-NEXT:    scvtf v3.2d, v3.2d
+; CHECK-SD-NEXT:    scvtf v1.2d, v1.2d
+; CHECK-SD-NEXT:    stp q0, q4, [x8]
+; CHECK-SD-NEXT:    stp q1, q3, [x8, #32]
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: stofp_v32i32_v32f64:
@@ -2336,46 +2336,46 @@ entry:
 define <32 x double> @utofp_v32i32_v32f64(<32 x i32> %a) {
 ; CHECK-SD-LABEL: utofp_v32i32_v32f64:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    ushll2 v16.2d, v7.4s, #0
-; CHECK-SD-NEXT:    ushll v7.2d, v7.2s, #0
 ; CHECK-SD-NEXT:    ushll2 v17.2d, v6.4s, #0
 ; CHECK-SD-NEXT:    ushll v6.2d, v6.2s, #0
-; CHECK-SD-NEXT:    ushll2 v19.2d, v4.4s, #0
+; CHECK-SD-NEXT:    ushll2 v16.2d, v7.4s, #0
+; CHECK-SD-NEXT:    ushll v19.2d, v3.2s, #0
+; CHECK-SD-NEXT:    ushll v7.2d, v7.2s, #0
+; CHECK-SD-NEXT:    ushll2 v3.2d, v3.4s, #0
+; CHECK-SD-NEXT:    ushll2 v20.2d, v4.4s, #0
 ; CHECK-SD-NEXT:    ushll v4.2d, v4.2s, #0
 ; CHECK-SD-NEXT:    ushll2 v18.2d, v5.4s, #0
-; CHECK-SD-NEXT:    ushll v5.2d, v5.2s, #0
-; CHECK-SD-NEXT:    ucvtf v16.2d, v16.2d
-; CHECK-SD-NEXT:    ucvtf v7.2d, v7.2d
 ; CHECK-SD-NEXT:    ucvtf v17.2d, v17.2d
 ; CHECK-SD-NEXT:    ucvtf v6.2d, v6.2d
+; CHECK-SD-NEXT:    ucvtf v16.2d, v16.2d
+; CHECK-SD-NEXT:    ucvtf v7.2d, v7.2d
+; CHECK-SD-NEXT:    ucvtf v3.2d, v3.2d
+; CHECK-SD-NEXT:    ushll2 v21.2d, v2.4s, #0
+; CHECK-SD-NEXT:    ucvtf v20.2d, v20.2d
 ; CHECK-SD-NEXT:    ucvtf v4.2d, v4.2d
+; CHECK-SD-NEXT:    ushll v5.2d, v5.2s, #0
+; CHECK-SD-NEXT:    ushll v2.2d, v2.2s, #0
 ; CHECK-SD-NEXT:    ucvtf v18.2d, v18.2d
-; CHECK-SD-NEXT:    ucvtf v5.2d, v5.2d
-; CHECK-SD-NEXT:    stp q7, q16, [x8, #224]
-; CHECK-SD-NEXT:    ushll2 v16.2d, v3.4s, #0
-; CHECK-SD-NEXT:    ushll v3.2d, v3.2s, #0
-; CHECK-SD-NEXT:    ucvtf v7.2d, v19.2d
 ; CHECK-SD-NEXT:    stp q6, q17, [x8, #192]
-; CHECK-SD-NEXT:    ushll2 v17.2d, v2.4s, #0
-; CHECK-SD-NEXT:    ushll v2.2d, v2.2s, #0
-; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
-; CHECK-SD-NEXT:    ucvtf v6.2d, v16.2d
-; CHECK-SD-NEXT:    ucvtf v3.2d, v3.2d
-; CHECK-SD-NEXT:    ushll2 v16.2d, v1.4s, #0
-; CHECK-SD-NEXT:    ushll v1.2d, v1.2s, #0
-; CHECK-SD-NEXT:    ucvtf v5.2d, v17.2d
-; CHECK-SD-NEXT:    stp q4, q7, [x8, #128]
+; CHECK-SD-NEXT:    ucvtf v17.2d, v19.2d
+; CHECK-SD-NEXT:    stp q7, q16, [x8, #224]
 ; CHECK-SD-NEXT:    ushll2 v7.2d, v0.4s, #0
 ; CHECK-SD-NEXT:    ushll v0.2d, v0.2s, #0
+; CHECK-SD-NEXT:    stp q4, q20, [x8, #128]
+; CHECK-SD-NEXT:    ucvtf v16.2d, v21.2d
+; CHECK-SD-NEXT:    ucvtf v5.2d, v5.2d
 ; CHECK-SD-NEXT:    ucvtf v2.2d, v2.2d
-; CHECK-SD-NEXT:    ucvtf v4.2d, v16.2d
-; CHECK-SD-NEXT:    stp q3, q6, [x8, #96]
-; CHECK-SD-NEXT:    ucvtf v1.2d, v1.2d
-; CHECK-SD-NEXT:    ucvtf v3.2d, v7.2d
+; CHECK-SD-NEXT:    stp q17, q3, [x8, #96]
+; CHECK-SD-NEXT:    ushll2 v3.2d, v1.4s, #0
+; CHECK-SD-NEXT:    ushll v1.2d, v1.2s, #0
+; CHECK-SD-NEXT:    ucvtf v4.2d, v7.2d
 ; CHECK-SD-NEXT:    ucvtf v0.2d, v0.2d
-; CHECK-SD-NEXT:    stp q2, q5, [x8, #64]
-; CHECK-SD-NEXT:    stp q1, q4, [x8, #32]
-; CHECK-SD-NEXT:    stp q0, q3, [x8]
+; CHECK-SD-NEXT:    stp q5, q18, [x8, #160]
+; CHECK-SD-NEXT:    stp q2, q16, [x8, #64]
+; CHECK-SD-NEXT:    ucvtf v3.2d, v3.2d
+; CHECK-SD-NEXT:    ucvtf v1.2d, v1.2d
+; CHECK-SD-NEXT:    stp q0, q4, [x8]
+; CHECK-SD-NEXT:    stp q1, q3, [x8, #32]
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: utofp_v32i32_v32f64:
@@ -2863,7 +2863,7 @@ define <32 x double> @stofp_v32i16_v32f64(<32 x i16> %a) {
 ; CHECK-SD:       // %bb.0: // %entry
 ; CHECK-SD-NEXT:    sshll2 v4.4s, v3.8h, #0
 ; CHECK-SD-NEXT:    sshll2 v5.4s, v2.8h, #0
-; CHECK-SD-NEXT:    sshll2 v7.4s, v1.8h, #0
+; CHECK-SD-NEXT:    sshll2 v16.4s, v1.8h, #0
 ; CHECK-SD-NEXT:    sshll2 v17.4s, v0.8h, #0
 ; CHECK-SD-NEXT:    sshll v3.4s, v3.4h, #0
 ; CHECK-SD-NEXT:    sshll v1.4s, v1.4h, #0
@@ -2871,43 +2871,43 @@ define <32 x double> @stofp_v32i16_v32f64(<32 x i16> %a) {
 ; CHECK-SD-NEXT:    sshll v0.4s, v0.4h, #0
 ; CHECK-SD-NEXT:    sshll2 v6.2d, v4.4s, #0
 ; CHECK-SD-NEXT:    sshll v4.2d, v4.2s, #0
-; CHECK-SD-NEXT:    sshll2 v16.2d, v5.4s, #0
+; CHECK-SD-NEXT:    sshll2 v7.2d, v5.4s, #0
 ; CHECK-SD-NEXT:    sshll v5.2d, v5.2s, #0
-; CHECK-SD-NEXT:    sshll2 v18.2d, v7.4s, #0
-; CHECK-SD-NEXT:    sshll v7.2d, v7.2s, #0
+; CHECK-SD-NEXT:    sshll2 v18.2d, v16.4s, #0
+; CHECK-SD-NEXT:    sshll v16.2d, v16.2s, #0
 ; CHECK-SD-NEXT:    sshll2 v19.2d, v17.4s, #0
 ; CHECK-SD-NEXT:    scvtf v6.2d, v6.2d
 ; CHECK-SD-NEXT:    scvtf v4.2d, v4.2d
-; CHECK-SD-NEXT:    scvtf v16.2d, v16.2d
-; CHECK-SD-NEXT:    scvtf v5.2d, v5.2d
 ; CHECK-SD-NEXT:    scvtf v7.2d, v7.2d
+; CHECK-SD-NEXT:    scvtf v5.2d, v5.2d
+; CHECK-SD-NEXT:    scvtf v16.2d, v16.2d
 ; CHECK-SD-NEXT:    stp q4, q6, [x8, #224]
 ; CHECK-SD-NEXT:    sshll v6.2d, v17.2s, #0
 ; CHECK-SD-NEXT:    scvtf v17.2d, v18.2d
-; CHECK-SD-NEXT:    sshll2 v4.2d, v3.4s, #0
-; CHECK-SD-NEXT:    stp q5, q16, [x8, #160]
+; CHECK-SD-NEXT:    stp q5, q7, [x8, #160]
+; CHECK-SD-NEXT:    sshll2 v7.2d, v3.4s, #0
 ; CHECK-SD-NEXT:    sshll v3.2d, v3.2s, #0
-; CHECK-SD-NEXT:    scvtf v16.2d, v19.2d
+; CHECK-SD-NEXT:    scvtf v4.2d, v19.2d
 ; CHECK-SD-NEXT:    scvtf v5.2d, v6.2d
 ; CHECK-SD-NEXT:    sshll2 v6.2d, v2.4s, #0
 ; CHECK-SD-NEXT:    sshll v2.2d, v2.2s, #0
-; CHECK-SD-NEXT:    scvtf v4.2d, v4.2d
+; CHECK-SD-NEXT:    scvtf v7.2d, v7.2d
 ; CHECK-SD-NEXT:    scvtf v3.2d, v3.2d
-; CHECK-SD-NEXT:    stp q7, q17, [x8, #96]
-; CHECK-SD-NEXT:    sshll2 v7.2d, v1.4s, #0
+; CHECK-SD-NEXT:    stp q16, q17, [x8, #96]
+; CHECK-SD-NEXT:    sshll2 v16.2d, v1.4s, #0
 ; CHECK-SD-NEXT:    sshll v1.2d, v1.2s, #0
 ; CHECK-SD-NEXT:    scvtf v6.2d, v6.2d
 ; CHECK-SD-NEXT:    scvtf v2.2d, v2.2d
-; CHECK-SD-NEXT:    stp q5, q16, [x8, #32]
-; CHECK-SD-NEXT:    sshll2 v5.2d, v0.4s, #0
+; CHECK-SD-NEXT:    stp q5, q4, [x8, #32]
+; CHECK-SD-NEXT:    sshll2 v4.2d, v0.4s, #0
 ; CHECK-SD-NEXT:    sshll v0.2d, v0.2s, #0
-; CHECK-SD-NEXT:    scvtf v7.2d, v7.2d
-; CHECK-SD-NEXT:    stp q3, q4, [x8, #192]
+; CHECK-SD-NEXT:    scvtf v5.2d, v16.2d
+; CHECK-SD-NEXT:    stp q3, q7, [x8, #192]
 ; CHECK-SD-NEXT:    scvtf v1.2d, v1.2d
-; CHECK-SD-NEXT:    scvtf v3.2d, v5.2d
+; CHECK-SD-NEXT:    scvtf v3.2d, v4.2d
 ; CHECK-SD-NEXT:    scvtf v0.2d, v0.2d
 ; CHECK-SD-NEXT:    stp q2, q6, [x8, #128]
-; CHECK-SD-NEXT:    stp q1, q7, [x8, #64]
+; CHECK-SD-NEXT:    stp q1, q5, [x8, #64]
 ; CHECK-SD-NEXT:    stp q0, q3, [x8]
 ; CHECK-SD-NEXT:    ret
 ;
@@ -2972,7 +2972,7 @@ define <32 x double> @utofp_v32i16_v32f64(<32 x i16> %a) {
 ; CHECK-SD:       // %bb.0: // %entry
 ; CHECK-SD-NEXT:    ushll2 v4.4s, v3.8h, #0
 ; CHECK-SD-NEXT:    ushll2 v5.4s, v2.8h, #0
-; CHECK-SD-NEXT:    ushll2 v7.4s, v1.8h, #0
+; CHECK-SD-NEXT:    ushll2 v16.4s, v1.8h, #0
 ; CHECK-SD-NEXT:    ushll2 v17.4s, v0.8h, #0
 ; CHECK-SD-NEXT:    ushll v3.4s, v3.4h, #0
 ; CHECK-SD-NEXT:    ushll v1.4s, v1.4h, #0
@@ -2980,43 +2980,43 @@ define <32 x double> @utofp_v32i16_v32f64(<32 x i16> %a) {
 ; CHECK-SD-NEXT:    ushll v0.4s, v0.4h, #0
 ; CHECK-SD-NEXT:    ushll2 v6.2d, v4.4s, #0
 ; CHECK-SD-NEXT:    ushll v4.2d, v4.2s, #0
-; CHECK-SD-NEXT:    ushll2 v16.2d, v5.4s, #0
+; CHECK-SD-NEXT:    ushll2 v7.2d, v5.4s, #0
 ; CHECK-SD-NEXT:    ushll v5.2d, v5.2s, #0
-; CHECK-SD-NEXT:    ushll2 v18.2d, v7.4s, #0
-; CHECK-SD-NEXT:    ushll v7.2d, v7.2s, #0
+; CHECK-SD-NEXT:    ushll2 v18.2d, v16.4s, #0
+; CHECK-SD-NEXT:    ushll v16.2d, v16.2s, #0
 ; CHECK-SD-NEXT:    ushll2 v19.2d, v17.4s, #0
 ; CHECK-SD-NEXT:    ucvtf v6.2d, v6.2d
 ; CHECK-SD-NEXT:    ucvtf v4.2d, v4.2d
-; CHECK-SD-NEXT:    ucvtf v16.2d, v16.2d
-; CHECK-SD-NEXT:    ucvtf v5.2d, v5.2d
 ; CHECK-SD-NEXT:    ucvtf v7.2d, v7.2d
+; CHECK-SD-NEXT:    ucvtf v5.2d, v5.2d
+; CHECK-SD-NEXT:    ucvtf v16.2d, v16.2d
 ; CHECK-SD-NEXT:    stp q4, q6, [x8, #224]
 ; CHECK-SD-NEXT:    ushll v6.2d, v17.2s, #0
 ; CHECK-SD-NEXT:    ucvtf v17.2d, v18.2d
-; CHECK-SD-NEXT:    ushll2 v4.2d, v3.4s, #0
-; CHECK-SD-NEXT:    stp q5, q16, [x8, #160]
+; CHECK-SD-NEXT:    stp q5, q7, [x8, #160]
+; CHECK-SD-NEXT:    ushll2 v7.2d, v3.4s, #0
 ; CHECK-SD-NEXT:    ushll v3.2d, v3.2s, #0
-; CHECK-SD-NEXT:    ucvtf v16.2d, v19.2d
+; CHECK-SD-NEXT:    ucvtf v4.2d, v19.2d
 ; CHECK-SD-NEXT:    ucvtf v5.2d, v6.2d
 ; CHECK-SD-NEXT:    ushll2 v6.2d, v2.4s, #0
 ; CHECK-SD-NEXT:    ushll v2.2d, v2.2s, #0
-; CHECK-SD-NEXT:    ucvtf v4.2d, v4.2d
+; CHECK-SD-NEXT:    ucvtf v7.2d, v7.2d
 ; CHECK-SD-NEXT:    ucvtf v3.2d, v3.2d
-; CHECK-SD-NEXT:    stp q7, q17, [x8, #96]
-; CHECK-SD-NEXT:    ushll2 v7.2d, v1.4s, #0
+; CHECK-SD-NEXT:    stp q16, q17, [x8, #96]
+; CHECK-SD-NEXT:    ushll2 v16.2d, v1.4s, #0
 ; CHECK-SD-NEXT:    ushll v1.2d, v1.2s, #0
 ; CHECK-SD-NEXT:    ucvtf v6.2d, v6.2d
 ; CHECK-SD-NEXT:    ucvtf v2.2d, v2.2d
-; CHECK-SD-NEXT:    stp q5, q16, [x8, #32]
-; CHECK-SD-NEXT:    ushll2 v5.2d, v0.4s, #0
+; CHECK-SD-NEXT:    stp q5, q4, [x8, #32]
+; CHECK-SD-NEXT:    ushll2 v4.2d, v0.4s, #0
 ; CHECK-SD-NEXT:    ushll v0.2d, v0.2s, #0
-; CHECK-SD-NEXT:    ucvtf v7.2d, v7.2d
-; CHECK-SD-NEXT:    stp q3, q4, [x8, #192]
+; CHECK-SD-NEXT:    ucvtf v5.2d, v16.2d
+; CHECK-SD-NEXT:    stp q3, q7, [x8, #192]
 ; CHECK-SD-NEXT:    ucvtf v1.2d, v1.2d
-; CHECK-SD-NEXT:    ucvtf v3.2d, v5.2d
+; CHECK-SD-NEXT:    ucvtf v3.2d, v4.2d
 ; CHECK-SD-NEXT:    ucvtf v0.2d, v0.2d
 ; CHECK-SD-NEXT:    stp q2, q6, [x8, #128]
-; CHECK-SD-NEXT:    stp q1, q7, [x8, #64]
+; CHECK-SD-NEXT:    stp q1, q5, [x8, #64]
 ; CHECK-SD-NEXT:    stp q0, q3, [x8]
 ; CHECK-SD-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AArch64/mul.ll b/llvm/test/CodeGen/AArch64/mul.ll
index 8d9a6e6b92914..39ce70f88d6ba 100644
--- a/llvm/test/CodeGen/AArch64/mul.ll
+++ b/llvm/test/CodeGen/AArch64/mul.ll
@@ -576,17 +576,17 @@ define <3 x i128> @v3i128(<3 x i128> %d, <3 x i128> %e) {
 ; CHECK-GI-NEXT:    mul x9, x2, x10
 ; CHECK-GI-NEXT:    umulh x14, x2, x10
 ; CHECK-GI-NEXT:    madd x10, x3, x10, x13
-; CHECK-GI-NEXT:    ldp x13, x15, [sp, #16]
+; CHECK-GI-NEXT:    ldp x15, x13, [sp, #16]
 ; CHECK-GI-NEXT:    mov x2, x9
 ; CHECK-GI-NEXT:    umulh x11, x0, x6
 ; CHECK-GI-NEXT:    mov x0, x8
-; CHECK-GI-NEXT:    mul x15, x4, x15
+; CHECK-GI-NEXT:    mul x13, x4, x13
 ; CHECK-GI-NEXT:    add x3, x10, x14
-; CHECK-GI-NEXT:    umulh x16, x4, x13
+; CHECK-GI-NEXT:    umulh x16, x4, x15
 ; CHECK-GI-NEXT:    add x1, x12, x11
-; CHECK-GI-NEXT:    madd x15, x5, x13, x15
-; CHECK-GI-NEXT:    mul x4, x4, x13
-; CHECK-GI-NEXT:    add x5, x15, x16
+; CHECK-GI-NEXT:    madd x13, x5, x15, x13
+; CHECK-GI-NEXT:    mul x4, x4, x15
+; CHECK-GI-NEXT:    add x5, x13, x16
 ; CHECK-GI-NEXT:    ret
 entry:
   %s = mul <3 x i128> %d, %e
@@ -638,14 +638,14 @@ define <4 x i128> @v4i128(<4 x i128> %d, <4 x i128> %e) {
 ; CHECK-GI-NEXT:    umulh x17, x4, x15
 ; CHECK-GI-NEXT:    add x3, x13, x14
 ; CHECK-GI-NEXT:    madd x15, x5, x15, x16
-; CHECK-GI-NEXT:    ldp x16, x18, [sp, #48]
+; CHECK-GI-NEXT:    ldp x18, x16, [sp, #48]
 ; CHECK-GI-NEXT:    mov x4, x10
-; CHECK-GI-NEXT:    mul x18, x6, x18
-; CHECK-GI-NEXT:    umulh x0, x6, x16
+; CHECK-GI-NEXT:    mul x16, x6, x16
+; CHECK-GI-NEXT:    umulh x0, x6, x18
 ; CHECK-GI-NEXT:    add x5, x15, x17
-; CHECK-GI-NEXT:    madd x18, x7, x16, x18
-; CHECK-GI-NEXT:    mul x6, x6, x16
-; CHECK-GI-NEXT:    add x7, x18, x0
+; CHECK-GI-NEXT:    madd x16, x7, x18, x16
+; CHECK-GI-NEXT:    mul x6, x6, x18
+; CHECK-GI-NEXT:    add x7, x16, x0
 ; CHECK-GI-NEXT:    mov x0, x8
 ; CHECK-GI-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/nontemporal-load.ll b/llvm/test/CodeGen/AArch64/nontemporal-load.ll
index adb209c0c6348..ffafe69b29266 100644
--- a/llvm/test/CodeGen/AArch64/nontemporal-load.ll
+++ b/llvm/test/CodeGen/AArch64/nontemporal-load.ll
@@ -472,16 +472,17 @@ define <33 x i8> @test_ldnp_v33i8(ptr %A) {
 define <4 x i65> @test_ldnp_v4i65(ptr %A) {
 ; CHECK-LABEL: test_ldnp_v4i65:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    ldp x8, x9, [x0, #16]
+; CHECK-NEXT:    ldp x8, x9, [x0, #8]
+; CHECK-NEXT:    ldr x10, [x0, #24]
 ; CHECK-NEXT:    ldrb w11, [x0, #32]
-; CHECK-NEXT:    ldp x0, x10, [x0]
+; CHECK-NEXT:    ldr x0, [x0]
+; CHECK-NEXT:    ubfx x5, x10, #2, #1
+; CHECK-NEXT:    extr x2, x9, x8, #1
+; CHECK-NEXT:    extr x4, x10, x9, #2
+; CHECK-NEXT:    extr x6, x11, x10, #3
+; CHECK-NEXT:    ubfx x3, x9, #1, #1
 ; CHECK-NEXT:    ubfx x7, x11, #3, #1
-; CHECK-NEXT:    extr x4, x9, x8, #2
-; CHECK-NEXT:    extr x6, x11, x9, #3
-; CHECK-NEXT:    ubfx x3, x8, #1, #1
-; CHECK-NEXT:    extr x2, x8, x10, #1
-; CHECK-NEXT:    ubfx x5, x9, #2, #1
-; CHECK-NEXT:    and x1, x10, #0x1
+; CHECK-NEXT:    and x1, x8, #0x1
 ; CHECK-NEXT:    ret
 ;
 ; CHECK-BE-LABEL: test_ldnp_v4i65:
diff --git a/llvm/test/CodeGen/AArch64/nzcv-save.ll b/llvm/test/CodeGen/AArch64/nzcv-save.ll
index c40e529ccab1b..cc666dd8d34e6 100644
--- a/llvm/test/CodeGen/AArch64/nzcv-save.ll
+++ b/llvm/test/CodeGen/AArch64/nzcv-save.ll
@@ -6,19 +6,19 @@
 define void @f(ptr nocapture %a, ptr nocapture %b, ptr nocapture %cc, ptr nocapture %dd) nounwind uwtable noinline ssp {
 ; CHECK-LABEL: f:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ldp x8, x10, [x2]
-; CHECK-NEXT:    ldp x9, x11, [x3]
+; CHECK-NEXT:    ldp x8, x11, [x3]
+; CHECK-NEXT:    ldp x9, x10, [x2]
 ; CHECK-NEXT:    ldp x13, x12, [x2, #16]
-; CHECK-NEXT:    adds x8, x8, x9
-; CHECK-NEXT:    ldp x14, x9, [x3, #16]
+; CHECK-NEXT:    adds x8, x9, x8
+; CHECK-NEXT:    ldp x9, x14, [x3, #16]
 ; CHECK-NEXT:    adcs x10, x10, x11
 ; CHECK-NEXT:    stp x8, x10, [x0]
-; CHECK-NEXT:    adcs x11, x13, x14
-; CHECK-NEXT:    adc x13, x12, x9
+; CHECK-NEXT:    adcs x9, x13, x9
+; CHECK-NEXT:    adc x11, x12, x14
 ; CHECK-NEXT:    orr x12, x12, #0x100
-; CHECK-NEXT:    adc x9, x12, x9
-; CHECK-NEXT:    stp x11, x13, [x0, #16]
-; CHECK-NEXT:    stp x11, x9, [x1, #16]
+; CHECK-NEXT:    stp x9, x11, [x0, #16]
+; CHECK-NEXT:    adc x11, x12, x14
+; CHECK-NEXT:    stp x9, x11, [x1, #16]
 ; CHECK-NEXT:    stp x8, x10, [x1]
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll b/llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll
index c77861509e4a1..7f144df499be0 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll
@@ -756,75 +756,75 @@ define <32 x i64> @llrint_v32f64(<32 x double> %x) {
 ; CHECK-NEXT:    mov z18.d, z16.d[2]
 ; CHECK-NEXT:    mov z7.d, z16.d[1]
 ; CHECK-NEXT:    fcvtzs x13, d3
-; CHECK-NEXT:    fcvtzs x14, d20
 ; CHECK-NEXT:    str x9, [sp, #128]
+; CHECK-NEXT:    fcvtzs x9, d20
 ; CHECK-NEXT:    mov z16.d, z4.d[3]
-; CHECK-NEXT:    fcvtzs x9, d18
-; CHECK-NEXT:    mov z18.d, z4.d[2]
+; CHECK-NEXT:    ldp q3, q19, [x29, #80]
 ; CHECK-NEXT:    frintx z2.d, p0/m, z2.d
 ; CHECK-NEXT:    stp x11, x10, [sp, #144]
-; CHECK-NEXT:    fcvtzs x10, d7
+; CHECK-NEXT:    fcvtzs x10, d18
+; CHECK-NEXT:    fcvtzs x11, d7
+; CHECK-NEXT:    mov z18.d, z4.d[2]
 ; CHECK-NEXT:    mov z7.d, z4.d[1]
 ; CHECK-NEXT:    str x13, [sp, #136]
-; CHECK-NEXT:    fcvtzs x11, d16
+; CHECK-NEXT:    fcvtzs x13, d16
 ; CHECK-NEXT:    mov z16.d, z6.d[3]
-; CHECK-NEXT:    fcvtzs x13, d18
-; CHECK-NEXT:    ldp q3, q19, [x29, #80]
-; CHECK-NEXT:    stp x9, x14, [sp, #176]
-; CHECK-NEXT:    fcvtzs x9, d4
+; CHECK-NEXT:    splice z3.d, p1, z3.d, z19.d
+; CHECK-NEXT:    mov z1.d, z5.d[1]
+; CHECK-NEXT:    frintx z0.d, p0/m, z0.d
+; CHECK-NEXT:    stp x10, x9, [sp, #176]
+; CHECK-NEXT:    fcvtzs x9, d18
+; CHECK-NEXT:    fcvtzs x10, d4
+; CHECK-NEXT:    stp x12, x11, [sp, #160]
+; CHECK-NEXT:    fcvtzs x11, d7
 ; CHECK-NEXT:    mov z4.d, z6.d[2]
-; CHECK-NEXT:    stp x12, x10, [sp, #160]
-; CHECK-NEXT:    fcvtzs x10, d7
 ; CHECK-NEXT:    mov z7.d, z6.d[1]
 ; CHECK-NEXT:    fcvtzs x12, d6
-; CHECK-NEXT:    splice z3.d, p1, z3.d, z19.d
 ; CHECK-NEXT:    mov z6.d, z5.d[2]
-; CHECK-NEXT:    stp x13, x11, [sp, #208]
-; CHECK-NEXT:    fcvtzs x11, d16
+; CHECK-NEXT:    frintx z3.d, p0/m, z3.d
+; CHECK-NEXT:    stp x9, x13, [sp, #208]
+; CHECK-NEXT:    fcvtzs x9, d16
 ; CHECK-NEXT:    fcvtzs x13, d4
+; CHECK-NEXT:    stp x10, x11, [sp, #192]
+; CHECK-NEXT:    fcvtzs x10, d7
 ; CHECK-NEXT:    mov z4.d, z5.d[3]
-; CHECK-NEXT:    mov z1.d, z5.d[1]
-; CHECK-NEXT:    frintx z0.d, p0/m, z0.d
-; CHECK-NEXT:    stp x9, x10, [sp, #192]
-; CHECK-NEXT:    fcvtzs x9, d7
-; CHECK-NEXT:    frintx z3.d, p0/m, z3.d
-; CHECK-NEXT:    fcvtzs x10, d4
-; CHECK-NEXT:    stp x13, x11, [sp, #240]
-; CHECK-NEXT:    fcvtzs x11, d6
-; CHECK-NEXT:    mov z4.d, z2.d[3]
-; CHECK-NEXT:    fcvtzs x13, d2
-; CHECK-NEXT:    stp x12, x9, [sp, #224]
-; CHECK-NEXT:    fcvtzs x9, d5
+; CHECK-NEXT:    fcvtzs x11, d4
+; CHECK-NEXT:    stp x13, x9, [sp, #240]
+; CHECK-NEXT:    fcvtzs x9, d6
+; CHECK-NEXT:    stp x12, x10, [sp, #224]
+; CHECK-NEXT:    fcvtzs x10, d5
 ; CHECK-NEXT:    fcvtzs x12, d1
+; CHECK-NEXT:    mov z4.d, z2.d[3]
 ; CHECK-NEXT:    mov z5.d, z2.d[2]
 ; CHECK-NEXT:    mov z1.d, z2.d[1]
+; CHECK-NEXT:    fcvtzs x13, d2
 ; CHECK-NEXT:    mov z2.d, z3.d[2]
-; CHECK-NEXT:    stp x11, x10, [sp, #16]
-; CHECK-NEXT:    fcvtzs x10, d4
-; CHECK-NEXT:    mov z4.d, z3.d[3]
+; CHECK-NEXT:    stp x9, x11, [sp, #16]
+; CHECK-NEXT:    fcvtzs x9, d4
 ; CHECK-NEXT:    fcvtzs x11, d5
-; CHECK-NEXT:    stp x9, x12, [sp]
-; CHECK-NEXT:    fcvtzs x9, d1
+; CHECK-NEXT:    stp x10, x12, [sp]
+; CHECK-NEXT:    fcvtzs x10, d1
+; CHECK-NEXT:    mov z4.d, z3.d[3]
 ; CHECK-NEXT:    mov z1.d, z3.d[1]
 ; CHECK-NEXT:    fcvtzs x12, d4
-; CHECK-NEXT:    stp x11, x10, [sp, #48]
-; CHECK-NEXT:    fcvtzs x10, d2
+; CHECK-NEXT:    stp x11, x9, [sp, #48]
+; CHECK-NEXT:    fcvtzs x9, d2
 ; CHECK-NEXT:    fcvtzs x11, d3
-; CHECK-NEXT:    stp x13, x9, [sp, #32]
-; CHECK-NEXT:    fcvtzs x9, d1
+; CHECK-NEXT:    stp x13, x10, [sp, #32]
+; CHECK-NEXT:    fcvtzs x10, d1
 ; CHECK-NEXT:    mov z2.d, z0.d[3]
 ; CHECK-NEXT:    mov z3.d, z0.d[2]
 ; CHECK-NEXT:    mov z1.d, z0.d[1]
-; CHECK-NEXT:    fcvtzs x13, d2
-; CHECK-NEXT:    stp x10, x12, [sp, #80]
+; CHECK-NEXT:    stp x9, x12, [sp, #80]
 ; CHECK-NEXT:    fcvtzs x12, d0
-; CHECK-NEXT:    fcvtzs x10, d3
-; CHECK-NEXT:    stp x11, x9, [sp, #64]
-; CHECK-NEXT:    fcvtzs x9, d1
-; CHECK-NEXT:    stp x10, x13, [sp, #112]
-; CHECK-NEXT:    add x10, sp, #192
-; CHECK-NEXT:    stp x12, x9, [sp, #96]
+; CHECK-NEXT:    fcvtzs x13, d2
+; CHECK-NEXT:    fcvtzs x9, d3
+; CHECK-NEXT:    stp x11, x10, [sp, #64]
+; CHECK-NEXT:    fcvtzs x10, d1
+; CHECK-NEXT:    stp x9, x13, [sp, #112]
 ; CHECK-NEXT:    add x9, sp, #128
+; CHECK-NEXT:    stp x12, x10, [sp, #96]
+; CHECK-NEXT:    add x10, sp, #192
 ; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x9]
 ; CHECK-NEXT:    add x9, sp, #160
 ; CHECK-NEXT:    ld1d { z2.d }, p0/z, [x10]
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll b/llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll
index 6a97e7ad64bf3..9fe8d92a182ac 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll
@@ -1506,75 +1506,75 @@ define <32 x iXLen> @lrint_v32f64(<32 x double> %x) {
 ; CHECK-i64-NEXT:    mov z18.d, z16.d[2]
 ; CHECK-i64-NEXT:    mov z7.d, z16.d[1]
 ; CHECK-i64-NEXT:    fcvtzs x13, d3
-; CHECK-i64-NEXT:    fcvtzs x14, d20
 ; CHECK-i64-NEXT:    str x9, [sp, #128]
+; CHECK-i64-NEXT:    fcvtzs x9, d20
 ; CHECK-i64-NEXT:    mov z16.d, z4.d[3]
-; CHECK-i64-NEXT:    fcvtzs x9, d18
-; CHECK-i64-NEXT:    mov z18.d, z4.d[2]
+; CHECK-i64-NEXT:    ldp q3, q19, [x29, #80]
 ; CHECK-i64-NEXT:    frintx z2.d, p0/m, z2.d
 ; CHECK-i64-NEXT:    stp x11, x10, [sp, #144]
-; CHECK-i64-NEXT:    fcvtzs x10, d7
+; CHECK-i64-NEXT:    fcvtzs x10, d18
+; CHECK-i64-NEXT:    fcvtzs x11, d7
+; CHECK-i64-NEXT:    mov z18.d, z4.d[2]
 ; CHECK-i64-NEXT:    mov z7.d, z4.d[1]
 ; CHECK-i64-NEXT:    str x13, [sp, #136]
-; CHECK-i64-NEXT:    fcvtzs x11, d16
+; CHECK-i64-NEXT:    fcvtzs x13, d16
 ; CHECK-i64-NEXT:    mov z16.d, z6.d[3]
-; CHECK-i64-NEXT:    fcvtzs x13, d18
-; CHECK-i64-NEXT:    ldp q3, q19, [x29, #80]
-; CHECK-i64-NEXT:    stp x9, x14, [sp, #176]
-; CHECK-i64-NEXT:    fcvtzs x9, d4
+; CHECK-i64-NEXT:    splice z3.d, p1, z3.d, z19.d
+; CHECK-i64-NEXT:    mov z1.d, z5.d[1]
+; CHECK-i64-NEXT:    frintx z0.d, p0/m, z0.d
+; CHECK-i64-NEXT:    stp x10, x9, [sp, #176]
+; CHECK-i64-NEXT:    fcvtzs x9, d18
+; CHECK-i64-NEXT:    fcvtzs x10, d4
+; CHECK-i64-NEXT:    stp x12, x11, [sp, #160]
+; CHECK-i64-NEXT:    fcvtzs x11, d7
 ; CHECK-i64-NEXT:    mov z4.d, z6.d[2]
-; CHECK-i64-NEXT:    stp x12, x10, [sp, #160]
-; CHECK-i64-NEXT:    fcvtzs x10, d7
 ; CHECK-i64-NEXT:    mov z7.d, z6.d[1]
 ; CHECK-i64-NEXT:    fcvtzs x12, d6
-; CHECK-i64-NEXT:    splice z3.d, p1, z3.d, z19.d
 ; CHECK-i64-NEXT:    mov z6.d, z5.d[2]
-; CHECK-i64-NEXT:    stp x13, x11, [sp, #208]
-; CHECK-i64-NEXT:    fcvtzs x11, d16
+; CHECK-i64-NEXT:    frintx z3.d, p0/m, z3.d
+; CHECK-i64-NEXT:    stp x9, x13, [sp, #208]
+; CHECK-i64-NEXT:    fcvtzs x9, d16
 ; CHECK-i64-NEXT:    fcvtzs x13, d4
+; CHECK-i64-NEXT:    stp x10, x11, [sp, #192]
+; CHECK-i64-NEXT:    fcvtzs x10, d7
 ; CHECK-i64-NEXT:    mov z4.d, z5.d[3]
-; CHECK-i64-NEXT:    mov z1.d, z5.d[1]
-; CHECK-i64-NEXT:    frintx z0.d, p0/m, z0.d
-; CHECK-i64-NEXT:    stp x9, x10, [sp, #192]
-; CHECK-i64-NEXT:    fcvtzs x9, d7
-; CHECK-i64-NEXT:    frintx z3.d, p0/m, z3.d
-; CHECK-i64-NEXT:    fcvtzs x10, d4
-; CHECK-i64-NEXT:    stp x13, x11, [sp, #240]
-; CHECK-i64-NEXT:    fcvtzs x11, d6
-; CHECK-i64-NEXT:    mov z4.d, z2.d[3]
-; CHECK-i64-NEXT:    fcvtzs x13, d2
-; CHECK-i64-NEXT:    stp x12, x9, [sp, #224]
-; CHECK-i64-NEXT:    fcvtzs x9, d5
+; CHECK-i64-NEXT:    fcvtzs x11, d4
+; CHECK-i64-NEXT:    stp x13, x9, [sp, #240]
+; CHECK-i64-NEXT:    fcvtzs x9, d6
+; CHECK-i64-NEXT:    stp x12, x10, [sp, #224]
+; CHECK-i64-NEXT:    fcvtzs x10, d5
 ; CHECK-i64-NEXT:    fcvtzs x12, d1
+; CHECK-i64-NEXT:    mov z4.d, z2.d[3]
 ; CHECK-i64-NEXT:    mov z5.d, z2.d[2]
 ; CHECK-i64-NEXT:    mov z1.d, z2.d[1]
+; CHECK-i64-NEXT:    fcvtzs x13, d2
 ; CHECK-i64-NEXT:    mov z2.d, z3.d[2]
-; CHECK-i64-NEXT:    stp x11, x10, [sp, #16]
-; CHECK-i64-NEXT:    fcvtzs x10, d4
-; CHECK-i64-NEXT:    mov z4.d, z3.d[3]
+; CHECK-i64-NEXT:    stp x9, x11, [sp, #16]
+; CHECK-i64-NEXT:    fcvtzs x9, d4
 ; CHECK-i64-NEXT:    fcvtzs x11, d5
-; CHECK-i64-NEXT:    stp x9, x12, [sp]
-; CHECK-i64-NEXT:    fcvtzs x9, d1
+; CHECK-i64-NEXT:    stp x10, x12, [sp]
+; CHECK-i64-NEXT:    fcvtzs x10, d1
+; CHECK-i64-NEXT:    mov z4.d, z3.d[3]
 ; CHECK-i64-NEXT:    mov z1.d, z3.d[1]
 ; CHECK-i64-NEXT:    fcvtzs x12, d4
-; CHECK-i64-NEXT:    stp x11, x10, [sp, #48]
-; CHECK-i64-NEXT:    fcvtzs x10, d2
+; CHECK-i64-NEXT:    stp x11, x9, [sp, #48]
+; CHECK-i64-NEXT:    fcvtzs x9, d2
 ; CHECK-i64-NEXT:    fcvtzs x11, d3
-; CHECK-i64-NEXT:    stp x13, x9, [sp, #32]
-; CHECK-i64-NEXT:    fcvtzs x9, d1
+; CHECK-i64-NEXT:    stp x13, x10, [sp, #32]
+; CHECK-i64-NEXT:    fcvtzs x10, d1
 ; CHECK-i64-NEXT:    mov z2.d, z0.d[3]
 ; CHECK-i64-NEXT:    mov z3.d, z0.d[2]
 ; CHECK-i64-NEXT:    mov z1.d, z0.d[1]
-; CHECK-i64-NEXT:    fcvtzs x13, d2
-; CHECK-i64-NEXT:    stp x10, x12, [sp, #80]
+; CHECK-i64-NEXT:    stp x9, x12, [sp, #80]
 ; CHECK-i64-NEXT:    fcvtzs x12, d0
-; CHECK-i64-NEXT:    fcvtzs x10, d3
-; CHECK-i64-NEXT:    stp x11, x9, [sp, #64]
-; CHECK-i64-NEXT:    fcvtzs x9, d1
-; CHECK-i64-NEXT:    stp x10, x13, [sp, #112]
-; CHECK-i64-NEXT:    add x10, sp, #192
-; CHECK-i64-NEXT:    stp x12, x9, [sp, #96]
+; CHECK-i64-NEXT:    fcvtzs x13, d2
+; CHECK-i64-NEXT:    fcvtzs x9, d3
+; CHECK-i64-NEXT:    stp x11, x10, [sp, #64]
+; CHECK-i64-NEXT:    fcvtzs x10, d1
+; CHECK-i64-NEXT:    stp x9, x13, [sp, #112]
 ; CHECK-i64-NEXT:    add x9, sp, #128
+; CHECK-i64-NEXT:    stp x12, x10, [sp, #96]
+; CHECK-i64-NEXT:    add x10, sp, #192
 ; CHECK-i64-NEXT:    ld1d { z0.d }, p0/z, [x9]
 ; CHECK-i64-NEXT:    add x9, sp, #160
 ; CHECK-i64-NEXT:    ld1d { z2.d }, p0/z, [x10]
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll
index d3c446c9904b2..d29e43509dfe9 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll
@@ -40,64 +40,64 @@ define <8 x i32> @fixed_bitselect_v8i32(ptr %pre_cond_ptr, ptr %left_ptr, ptr %r
 ; NONEON-NOSVE-NEXT:    stp q0, q2, [sp, #-128]!
 ; NONEON-NOSVE-NEXT:    .cfi_def_cfa_offset 128
 ; NONEON-NOSVE-NEXT:    stp q1, q3, [sp, #48]
-; NONEON-NOSVE-NEXT:    ldp w8, w14, [sp, #48]
-; NONEON-NOSVE-NEXT:    ldp w9, w4, [sp, #64]
-; NONEON-NOSVE-NEXT:    ldp w13, w11, [sp, #56]
-; NONEON-NOSVE-NEXT:    neg w3, w8
-; NONEON-NOSVE-NEXT:    neg w15, w14
+; NONEON-NOSVE-NEXT:    ldp w13, w11, [sp, #48]
+; NONEON-NOSVE-NEXT:    ldp w14, w4, [sp, #64]
+; NONEON-NOSVE-NEXT:    ldp w17, w16, [sp]
+; NONEON-NOSVE-NEXT:    ldp w9, w8, [sp, #56]
+; NONEON-NOSVE-NEXT:    neg w15, w11
+; NONEON-NOSVE-NEXT:    neg w3, w13
 ; NONEON-NOSVE-NEXT:    str q4, [sp, #32]
-; NONEON-NOSVE-NEXT:    and w9, w3, w9
+; NONEON-NOSVE-NEXT:    and w14, w3, w14
 ; NONEON-NOSVE-NEXT:    and w15, w15, w4
-; NONEON-NOSVE-NEXT:    str q5, [sp, #80]
+; NONEON-NOSVE-NEXT:    neg w1, w17
 ; NONEON-NOSVE-NEXT:    ldp w5, w3, [sp, #72]
-; NONEON-NOSVE-NEXT:    ldp w16, w12, [sp]
-; NONEON-NOSVE-NEXT:    neg w4, w11
-; NONEON-NOSVE-NEXT:    neg w2, w13
-; NONEON-NOSVE-NEXT:    sub w11, w11, #1
-; NONEON-NOSVE-NEXT:    and w3, w4, w3
-; NONEON-NOSVE-NEXT:    and w2, w2, w5
-; NONEON-NOSVE-NEXT:    sub w13, w13, #1
 ; NONEON-NOSVE-NEXT:    ldp w6, w4, [sp, #16]
-; NONEON-NOSVE-NEXT:    ldp w10, w17, [sp, #8]
-; NONEON-NOSVE-NEXT:    neg w1, w16
-; NONEON-NOSVE-NEXT:    neg w0, w12
-; NONEON-NOSVE-NEXT:    sub w16, w16, #1
+; NONEON-NOSVE-NEXT:    ldp w12, w10, [sp, #8]
+; NONEON-NOSVE-NEXT:    neg w2, w9
+; NONEON-NOSVE-NEXT:    neg w7, w8
+; NONEON-NOSVE-NEXT:    sub w17, w17, #1
+; NONEON-NOSVE-NEXT:    and w2, w2, w5
 ; NONEON-NOSVE-NEXT:    and w1, w1, w6
-; NONEON-NOSVE-NEXT:    and w0, w0, w4
-; NONEON-NOSVE-NEXT:    sub w12, w12, #1
+; NONEON-NOSVE-NEXT:    and w3, w7, w3
 ; NONEON-NOSVE-NEXT:    ldp w5, w6, [sp, #24]
-; NONEON-NOSVE-NEXT:    neg w18, w17
-; NONEON-NOSVE-NEXT:    neg w4, w10
-; NONEON-NOSVE-NEXT:    sub w17, w17, #1
+; NONEON-NOSVE-NEXT:    neg w0, w12
+; NONEON-NOSVE-NEXT:    neg w7, w16
+; NONEON-NOSVE-NEXT:    neg w18, w10
+; NONEON-NOSVE-NEXT:    and w4, w7, w4
 ; NONEON-NOSVE-NEXT:    sub w10, w10, #1
-; NONEON-NOSVE-NEXT:    sub w14, w14, #1
-; NONEON-NOSVE-NEXT:    sub w8, w8, #1
-; NONEON-NOSVE-NEXT:    and w4, w4, w5
+; NONEON-NOSVE-NEXT:    sub w12, w12, #1
+; NONEON-NOSVE-NEXT:    and w0, w0, w5
 ; NONEON-NOSVE-NEXT:    and w18, w18, w6
-; NONEON-NOSVE-NEXT:    ldp w5, w6, [sp, #32]
+; NONEON-NOSVE-NEXT:    str q5, [sp, #80]
+; NONEON-NOSVE-NEXT:    ldp w7, w5, [sp, #32]
+; NONEON-NOSVE-NEXT:    sub w16, w16, #1
+; NONEON-NOSVE-NEXT:    sub w8, w8, #1
+; NONEON-NOSVE-NEXT:    sub w9, w9, #1
+; NONEON-NOSVE-NEXT:    and w17, w17, w7
 ; NONEON-NOSVE-NEXT:    and w16, w16, w5
+; NONEON-NOSVE-NEXT:    ldp w6, w7, [sp, #40]
 ; NONEON-NOSVE-NEXT:    and w12, w12, w6
-; NONEON-NOSVE-NEXT:    ldp w5, w6, [sp, #40]
-; NONEON-NOSVE-NEXT:    and w10, w10, w5
-; NONEON-NOSVE-NEXT:    and w17, w17, w6
-; NONEON-NOSVE-NEXT:    orr w17, w17, w18
-; NONEON-NOSVE-NEXT:    orr w10, w10, w4
-; NONEON-NOSVE-NEXT:    ldp w18, w4, [sp, #88]
+; NONEON-NOSVE-NEXT:    and w10, w10, w7
+; NONEON-NOSVE-NEXT:    orr w10, w10, w18
+; NONEON-NOSVE-NEXT:    orr w12, w12, w0
+; NONEON-NOSVE-NEXT:    ldp w18, w0, [sp, #88]
 ; NONEON-NOSVE-NEXT:    ldp w5, w6, [sp, #80]
-; NONEON-NOSVE-NEXT:    stp w10, w17, [sp, #104]
-; NONEON-NOSVE-NEXT:    orr w10, w12, w0
-; NONEON-NOSVE-NEXT:    orr w12, w16, w1
-; NONEON-NOSVE-NEXT:    and w11, w11, w4
-; NONEON-NOSVE-NEXT:    stp w12, w10, [sp, #96]
-; NONEON-NOSVE-NEXT:    and w10, w13, w18
-; NONEON-NOSVE-NEXT:    orr w11, w11, w3
-; NONEON-NOSVE-NEXT:    and w12, w14, w6
-; NONEON-NOSVE-NEXT:    orr w10, w10, w2
-; NONEON-NOSVE-NEXT:    and w8, w8, w5
-; NONEON-NOSVE-NEXT:    stp w10, w11, [sp, #120]
-; NONEON-NOSVE-NEXT:    orr w10, w12, w15
-; NONEON-NOSVE-NEXT:    orr w8, w8, w9
-; NONEON-NOSVE-NEXT:    stp w8, w10, [sp, #112]
+; NONEON-NOSVE-NEXT:    stp w12, w10, [sp, #104]
+; NONEON-NOSVE-NEXT:    sub w10, w11, #1
+; NONEON-NOSVE-NEXT:    sub w11, w13, #1
+; NONEON-NOSVE-NEXT:    and w8, w8, w0
+; NONEON-NOSVE-NEXT:    and w9, w9, w18
+; NONEON-NOSVE-NEXT:    orr w12, w16, w4
+; NONEON-NOSVE-NEXT:    orr w8, w8, w3
+; NONEON-NOSVE-NEXT:    orr w9, w9, w2
+; NONEON-NOSVE-NEXT:    and w10, w10, w6
+; NONEON-NOSVE-NEXT:    stp w9, w8, [sp, #120]
+; NONEON-NOSVE-NEXT:    and w8, w11, w5
+; NONEON-NOSVE-NEXT:    orr w13, w17, w1
+; NONEON-NOSVE-NEXT:    orr w9, w10, w15
+; NONEON-NOSVE-NEXT:    orr w8, w8, w14
+; NONEON-NOSVE-NEXT:    stp w13, w12, [sp, #96]
+; NONEON-NOSVE-NEXT:    stp w8, w9, [sp, #112]
 ; NONEON-NOSVE-NEXT:    ldp q0, q1, [sp, #96]
 ; NONEON-NOSVE-NEXT:    add sp, sp, #128
 ; NONEON-NOSVE-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll
index 5f6b60a767f9d..4fe303b9bbf46 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll
@@ -26,18 +26,18 @@ define void @fp_convert_combine_crash(ptr %a, ptr %b) {
 ; NONEON-NOSVE-NEXT:    .cfi_def_cfa_offset 64
 ; NONEON-NOSVE-NEXT:    ldp s1, s0, [sp, #24]
 ; NONEON-NOSVE-NEXT:    fcvtzs w8, s0, #3
-; NONEON-NOSVE-NEXT:    ldp s0, s2, [sp, #16]
+; NONEON-NOSVE-NEXT:    ldp s2, s0, [sp, #16]
 ; NONEON-NOSVE-NEXT:    fcvtzs w9, s1, #3
-; NONEON-NOSVE-NEXT:    fcvtzs w10, s2, #3
-; NONEON-NOSVE-NEXT:    fcvtzs w11, s0, #3
-; NONEON-NOSVE-NEXT:    ldp s2, s1, [sp, #8]
-; NONEON-NOSVE-NEXT:    ldp s0, s3, [sp]
+; NONEON-NOSVE-NEXT:    fcvtzs w10, s0, #3
+; NONEON-NOSVE-NEXT:    fcvtzs w11, s2, #3
+; NONEON-NOSVE-NEXT:    ldp s1, s0, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp s2, s3, [sp]
 ; NONEON-NOSVE-NEXT:    stp w9, w8, [sp, #56]
-; NONEON-NOSVE-NEXT:    fcvtzs w12, s1, #3
-; NONEON-NOSVE-NEXT:    fcvtzs w8, s2, #3
+; NONEON-NOSVE-NEXT:    fcvtzs w12, s0, #3
+; NONEON-NOSVE-NEXT:    fcvtzs w8, s1, #3
 ; NONEON-NOSVE-NEXT:    stp w11, w10, [sp, #48]
 ; NONEON-NOSVE-NEXT:    fcvtzs w9, s3, #3
-; NONEON-NOSVE-NEXT:    fcvtzs w10, s0, #3
+; NONEON-NOSVE-NEXT:    fcvtzs w10, s2, #3
 ; NONEON-NOSVE-NEXT:    stp w8, w12, [sp, #40]
 ; NONEON-NOSVE-NEXT:    stp w10, w9, [sp, #32]
 ; NONEON-NOSVE-NEXT:    ldp q0, q1, [sp, #32]
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll
index 4eaaee7ce5055..6618adbc9de32 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll
@@ -1030,10 +1030,10 @@ define float @fmaxv_v8f32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    ldp s3, s2, [sp]
 ; NONEON-NOSVE-NEXT:    fmaxnm s0, s2, s0
 ; NONEON-NOSVE-NEXT:    fmaxnm s1, s3, s1
-; NONEON-NOSVE-NEXT:    ldp s2, s4, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp s3, s4, [sp, #8]
 ; NONEON-NOSVE-NEXT:    fmaxnm s0, s1, s0
-; NONEON-NOSVE-NEXT:    ldp s3, s1, [sp, #24]
-; NONEON-NOSVE-NEXT:    fmaxnm s2, s2, s3
+; NONEON-NOSVE-NEXT:    ldp s2, s1, [sp, #24]
+; NONEON-NOSVE-NEXT:    fmaxnm s2, s3, s2
 ; NONEON-NOSVE-NEXT:    fmaxnm s1, s4, s1
 ; NONEON-NOSVE-NEXT:    fmaxnm s0, s0, s2
 ; NONEON-NOSVE-NEXT:    fmaxnm s0, s0, s1
@@ -1360,10 +1360,10 @@ define float @fminv_v8f32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    ldp s3, s2, [sp]
 ; NONEON-NOSVE-NEXT:    fminnm s0, s2, s0
 ; NONEON-NOSVE-NEXT:    fminnm s1, s3, s1
-; NONEON-NOSVE-NEXT:    ldp s2, s4, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp s3, s4, [sp, #8]
 ; NONEON-NOSVE-NEXT:    fminnm s0, s1, s0
-; NONEON-NOSVE-NEXT:    ldp s3, s1, [sp, #24]
-; NONEON-NOSVE-NEXT:    fminnm s2, s2, s3
+; NONEON-NOSVE-NEXT:    ldp s2, s1, [sp, #24]
+; NONEON-NOSVE-NEXT:    fminnm s2, s3, s2
 ; NONEON-NOSVE-NEXT:    fminnm s1, s4, s1
 ; NONEON-NOSVE-NEXT:    fminnm s0, s0, s2
 ; NONEON-NOSVE-NEXT:    fminnm s0, s0, s1
@@ -1690,10 +1690,10 @@ define float @fmaximumv_v8f32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    ldp s3, s2, [sp]
 ; NONEON-NOSVE-NEXT:    fmax s0, s2, s0
 ; NONEON-NOSVE-NEXT:    fmax s1, s3, s1
-; NONEON-NOSVE-NEXT:    ldp s2, s4, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp s3, s4, [sp, #8]
 ; NONEON-NOSVE-NEXT:    fmax s0, s1, s0
-; NONEON-NOSVE-NEXT:    ldp s3, s1, [sp, #24]
-; NONEON-NOSVE-NEXT:    fmax s2, s2, s3
+; NONEON-NOSVE-NEXT:    ldp s2, s1, [sp, #24]
+; NONEON-NOSVE-NEXT:    fmax s2, s3, s2
 ; NONEON-NOSVE-NEXT:    fmax s1, s4, s1
 ; NONEON-NOSVE-NEXT:    fmax s0, s0, s2
 ; NONEON-NOSVE-NEXT:    fmax s0, s0, s1
@@ -2020,10 +2020,10 @@ define float @fminimumv_v8f32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    ldp s3, s2, [sp]
 ; NONEON-NOSVE-NEXT:    fmin s0, s2, s0
 ; NONEON-NOSVE-NEXT:    fmin s1, s3, s1
-; NONEON-NOSVE-NEXT:    ldp s2, s4, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp s3, s4, [sp, #8]
 ; NONEON-NOSVE-NEXT:    fmin s0, s1, s0
-; NONEON-NOSVE-NEXT:    ldp s3, s1, [sp, #24]
-; NONEON-NOSVE-NEXT:    fmin s2, s2, s3
+; NONEON-NOSVE-NEXT:    ldp s2, s1, [sp, #24]
+; NONEON-NOSVE-NEXT:    fmin s2, s3, s2
 ; NONEON-NOSVE-NEXT:    fmin s1, s4, s1
 ; NONEON-NOSVE-NEXT:    fmin s0, s0, s2
 ; NONEON-NOSVE-NEXT:    fmin s0, s0, s1
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll
index ad5f91a5f39a4..ec0693a541e44 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll
@@ -194,12 +194,12 @@ define <8 x half> @select_v8f16(<8 x half> %op1, <8 x half> %op2, <8 x i1> %mask
 define void @select_v16f16(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v16f16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.h, vl8
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    fcmeq p1.h, p0/z, z0.h, z1.h
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    fcmeq p1.h, p0/z, z1.h, z0.h
 ; CHECK-NEXT:    fcmeq p0.h, p0/z, z2.h, z3.h
-; CHECK-NEXT:    sel z0.h, p1, z0.h, z1.h
+; CHECK-NEXT:    mov z0.h, p1/m, z1.h
 ; CHECK-NEXT:    sel z1.h, p0, z2.h, z3.h
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
@@ -429,12 +429,12 @@ define <4 x float> @select_v4f32(<4 x float> %op1, <4 x float> %op2, <4 x i1> %m
 define void @select_v8f32(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v8f32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.s, vl4
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    fcmeq p1.s, p0/z, z0.s, z1.s
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    fcmeq p1.s, p0/z, z1.s, z0.s
 ; CHECK-NEXT:    fcmeq p0.s, p0/z, z2.s, z3.s
-; CHECK-NEXT:    sel z0.s, p1, z0.s, z1.s
+; CHECK-NEXT:    mov z0.s, p1/m, z1.s
 ; CHECK-NEXT:    sel z1.s, p0, z2.s, z3.s
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
@@ -553,12 +553,12 @@ define <2 x double> @select_v2f64(<2 x double> %op1, <2 x double> %op2, <2 x i1>
 define void @select_v4f64(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v4f64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.d, vl2
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    fcmeq p1.d, p0/z, z0.d, z1.d
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    fcmeq p1.d, p0/z, z1.d, z0.d
 ; CHECK-NEXT:    fcmeq p0.d, p0/z, z2.d, z3.d
-; CHECK-NEXT:    sel z0.d, p1, z0.d, z1.d
+; CHECK-NEXT:    mov z0.d, p1/m, z1.d
 ; CHECK-NEXT:    sel z1.d, p0, z2.d, z3.d
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll
index 25a6ea490c163..40c8ab27c0b02 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll
@@ -1146,65 +1146,64 @@ define void @sext_v16i8_v16i64(<16 x i8> %a, ptr %out) {
 define void @sext_v32i8_v32i64(ptr %in, ptr %out) {
 ; CHECK-LABEL: sext_v32i8_v32i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q1, q0, [x0]
+; CHECK-NEXT:    ldp q0, q1, [x0]
 ; CHECK-NEXT:    add z0.b, z0.b, z0.b
 ; CHECK-NEXT:    add z1.b, z1.b, z1.b
 ; CHECK-NEXT:    mov z2.d, z0.d
+; CHECK-NEXT:    sunpklo z3.h, z1.b
+; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
+; CHECK-NEXT:    ext z2.b, z2.b, z0.b, #8
 ; CHECK-NEXT:    sunpklo z0.h, z0.b
-; CHECK-NEXT:    mov z3.d, z1.d
 ; CHECK-NEXT:    sunpklo z1.h, z1.b
-; CHECK-NEXT:    ext z2.b, z2.b, z2.b, #8
+; CHECK-NEXT:    sunpklo z4.s, z3.h
 ; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
-; CHECK-NEXT:    sunpklo z4.s, z0.h
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
-; CHECK-NEXT:    sunpklo z5.s, z1.h
-; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
 ; CHECK-NEXT:    sunpklo z2.h, z2.b
-; CHECK-NEXT:    sunpklo z3.h, z3.b
-; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    sunpklo z5.s, z0.h
+; CHECK-NEXT:    mov z7.d, z1.d
+; CHECK-NEXT:    sunpklo z3.s, z3.h
 ; CHECK-NEXT:    sunpklo z16.d, z4.s
 ; CHECK-NEXT:    ext z4.b, z4.b, z4.b, #8
-; CHECK-NEXT:    sunpklo z1.s, z1.h
-; CHECK-NEXT:    sunpklo z17.d, z5.s
-; CHECK-NEXT:    ext z5.b, z5.b, z5.b, #8
+; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
 ; CHECK-NEXT:    sunpklo z6.s, z2.h
-; CHECK-NEXT:    sunpklo z7.s, z3.h
 ; CHECK-NEXT:    ext z2.b, z2.b, z2.b, #8
-; CHECK-NEXT:    sunpklo z4.d, z4.s
-; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
-; CHECK-NEXT:    sunpklo z19.d, z0.s
+; CHECK-NEXT:    ext z7.b, z7.b, z1.b, #8
+; CHECK-NEXT:    mov z17.d, z5.d
+; CHECK-NEXT:    sunpklo z1.s, z1.h
 ; CHECK-NEXT:    sunpklo z5.d, z5.s
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    sunpklo z4.d, z4.s
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    sunpklo z19.d, z3.s
 ; CHECK-NEXT:    sunpklo z2.s, z2.h
+; CHECK-NEXT:    sunpklo z7.s, z7.h
+; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
+; CHECK-NEXT:    ext z17.b, z17.b, z17.b, #8
 ; CHECK-NEXT:    sunpklo z18.d, z6.s
 ; CHECK-NEXT:    ext z6.b, z6.b, z6.b, #8
-; CHECK-NEXT:    sunpklo z3.s, z3.h
+; CHECK-NEXT:    sunpklo z20.d, z1.s
+; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
 ; CHECK-NEXT:    stp q16, q4, [x1, #128]
-; CHECK-NEXT:    mov z16.d, z7.d
-; CHECK-NEXT:    sunpklo z0.d, z0.s
-; CHECK-NEXT:    stp q17, q5, [x1]
-; CHECK-NEXT:    sunpklo z5.d, z7.s
-; CHECK-NEXT:    sunpklo z4.d, z6.s
-; CHECK-NEXT:    mov z6.d, z1.d
-; CHECK-NEXT:    ext z16.b, z16.b, z7.b, #8
-; CHECK-NEXT:    mov z7.d, z2.d
-; CHECK-NEXT:    stp q19, q0, [x1, #160]
-; CHECK-NEXT:    sunpklo z0.d, z2.s
-; CHECK-NEXT:    ext z6.b, z6.b, z1.b, #8
-; CHECK-NEXT:    sunpklo z1.d, z1.s
-; CHECK-NEXT:    stp q18, q4, [x1, #192]
-; CHECK-NEXT:    mov z4.d, z3.d
-; CHECK-NEXT:    ext z7.b, z7.b, z2.b, #8
-; CHECK-NEXT:    sunpklo z16.d, z16.s
-; CHECK-NEXT:    sunpklo z6.d, z6.s
-; CHECK-NEXT:    ext z4.b, z4.b, z3.b, #8
-; CHECK-NEXT:    sunpklo z2.d, z7.s
 ; CHECK-NEXT:    sunpklo z3.d, z3.s
-; CHECK-NEXT:    stp q5, q16, [x1, #64]
-; CHECK-NEXT:    stp q1, q6, [x1, #32]
+; CHECK-NEXT:    sunpklo z16.d, z0.s
+; CHECK-NEXT:    sunpklo z17.d, z17.s
+; CHECK-NEXT:    mov z4.d, z7.d
+; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    sunpklo z1.d, z1.s
+; CHECK-NEXT:    ext z4.b, z4.b, z7.b, #8
+; CHECK-NEXT:    stp q19, q3, [x1, #160]
+; CHECK-NEXT:    sunpklo z0.d, z0.s
+; CHECK-NEXT:    stp q5, q17, [x1]
+; CHECK-NEXT:    sunpklo z5.d, z6.s
+; CHECK-NEXT:    mov z6.d, z2.d
+; CHECK-NEXT:    stp q20, q1, [x1, #192]
+; CHECK-NEXT:    sunpklo z7.d, z7.s
 ; CHECK-NEXT:    sunpklo z1.d, z4.s
-; CHECK-NEXT:    stp q0, q2, [x1, #224]
-; CHECK-NEXT:    stp q3, q1, [x1, #96]
+; CHECK-NEXT:    ext z6.b, z6.b, z2.b, #8
+; CHECK-NEXT:    sunpklo z2.d, z2.s
+; CHECK-NEXT:    stp q16, q0, [x1, #32]
+; CHECK-NEXT:    stp q18, q5, [x1, #64]
+; CHECK-NEXT:    sunpklo z3.d, z6.s
+; CHECK-NEXT:    stp q7, q1, [x1, #224]
+; CHECK-NEXT:    stp q2, q3, [x1, #96]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: sext_v32i8_v32i64:
@@ -3131,65 +3130,64 @@ define void @zext_v16i8_v16i64(<16 x i8> %a, ptr %out) {
 define void @zext_v32i8_v32i64(ptr %in, ptr %out) {
 ; CHECK-LABEL: zext_v32i8_v32i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q1, q0, [x0]
+; CHECK-NEXT:    ldp q0, q1, [x0]
 ; CHECK-NEXT:    add z0.b, z0.b, z0.b
 ; CHECK-NEXT:    add z1.b, z1.b, z1.b
 ; CHECK-NEXT:    mov z2.d, z0.d
+; CHECK-NEXT:    uunpklo z3.h, z1.b
+; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
+; CHECK-NEXT:    ext z2.b, z2.b, z0.b, #8
 ; CHECK-NEXT:    uunpklo z0.h, z0.b
-; CHECK-NEXT:    mov z3.d, z1.d
 ; CHECK-NEXT:    uunpklo z1.h, z1.b
-; CHECK-NEXT:    ext z2.b, z2.b, z2.b, #8
+; CHECK-NEXT:    uunpklo z4.s, z3.h
 ; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
-; CHECK-NEXT:    uunpklo z4.s, z0.h
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
-; CHECK-NEXT:    uunpklo z5.s, z1.h
-; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
 ; CHECK-NEXT:    uunpklo z2.h, z2.b
-; CHECK-NEXT:    uunpklo z3.h, z3.b
-; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    uunpklo z5.s, z0.h
+; CHECK-NEXT:    mov z7.d, z1.d
+; CHECK-NEXT:    uunpklo z3.s, z3.h
 ; CHECK-NEXT:    uunpklo z16.d, z4.s
 ; CHECK-NEXT:    ext z4.b, z4.b, z4.b, #8
-; CHECK-NEXT:    uunpklo z1.s, z1.h
-; CHECK-NEXT:    uunpklo z17.d, z5.s
-; CHECK-NEXT:    ext z5.b, z5.b, z5.b, #8
+; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
 ; CHECK-NEXT:    uunpklo z6.s, z2.h
-; CHECK-NEXT:    uunpklo z7.s, z3.h
 ; CHECK-NEXT:    ext z2.b, z2.b, z2.b, #8
-; CHECK-NEXT:    uunpklo z4.d, z4.s
-; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
-; CHECK-NEXT:    uunpklo z19.d, z0.s
+; CHECK-NEXT:    ext z7.b, z7.b, z1.b, #8
+; CHECK-NEXT:    mov z17.d, z5.d
+; CHECK-NEXT:    uunpklo z1.s, z1.h
 ; CHECK-NEXT:    uunpklo z5.d, z5.s
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    uunpklo z4.d, z4.s
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    uunpklo z19.d, z3.s
 ; CHECK-NEXT:    uunpklo z2.s, z2.h
+; CHECK-NEXT:    uunpklo z7.s, z7.h
+; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
+; CHECK-NEXT:    ext z17.b, z17.b, z17.b, #8
 ; CHECK-NEXT:    uunpklo z18.d, z6.s
 ; CHECK-NEXT:    ext z6.b, z6.b, z6.b, #8
-; CHECK-NEXT:    uunpklo z3.s, z3.h
+; CHECK-NEXT:    uunpklo z20.d, z1.s
+; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
 ; CHECK-NEXT:    stp q16, q4, [x1, #128]
-; CHECK-NEXT:    mov z16.d, z7.d
-; CHECK-NEXT:    uunpklo z0.d, z0.s
-; CHECK-NEXT:    stp q17, q5, [x1]
-; CHECK-NEXT:    uunpklo z5.d, z7.s
-; CHECK-NEXT:    uunpklo z4.d, z6.s
-; CHECK-NEXT:    mov z6.d, z1.d
-; CHECK-NEXT:    ext z16.b, z16.b, z7.b, #8
-; CHECK-NEXT:    mov z7.d, z2.d
-; CHECK-NEXT:    stp q19, q0, [x1, #160]
-; CHECK-NEXT:    uunpklo z0.d, z2.s
-; CHECK-NEXT:    ext z6.b, z6.b, z1.b, #8
-; CHECK-NEXT:    uunpklo z1.d, z1.s
-; CHECK-NEXT:    stp q18, q4, [x1, #192]
-; CHECK-NEXT:    mov z4.d, z3.d
-; CHECK-NEXT:    ext z7.b, z7.b, z2.b, #8
-; CHECK-NEXT:    uunpklo z16.d, z16.s
-; CHECK-NEXT:    uunpklo z6.d, z6.s
-; CHECK-NEXT:    ext z4.b, z4.b, z3.b, #8
-; CHECK-NEXT:    uunpklo z2.d, z7.s
 ; CHECK-NEXT:    uunpklo z3.d, z3.s
-; CHECK-NEXT:    stp q5, q16, [x1, #64]
-; CHECK-NEXT:    stp q1, q6, [x1, #32]
+; CHECK-NEXT:    uunpklo z16.d, z0.s
+; CHECK-NEXT:    uunpklo z17.d, z17.s
+; CHECK-NEXT:    mov z4.d, z7.d
+; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    uunpklo z1.d, z1.s
+; CHECK-NEXT:    ext z4.b, z4.b, z7.b, #8
+; CHECK-NEXT:    stp q19, q3, [x1, #160]
+; CHECK-NEXT:    uunpklo z0.d, z0.s
+; CHECK-NEXT:    stp q5, q17, [x1]
+; CHECK-NEXT:    uunpklo z5.d, z6.s
+; CHECK-NEXT:    mov z6.d, z2.d
+; CHECK-NEXT:    stp q20, q1, [x1, #192]
+; CHECK-NEXT:    uunpklo z7.d, z7.s
 ; CHECK-NEXT:    uunpklo z1.d, z4.s
-; CHECK-NEXT:    stp q0, q2, [x1, #224]
-; CHECK-NEXT:    stp q3, q1, [x1, #96]
+; CHECK-NEXT:    ext z6.b, z6.b, z2.b, #8
+; CHECK-NEXT:    uunpklo z2.d, z2.s
+; CHECK-NEXT:    stp q16, q0, [x1, #32]
+; CHECK-NEXT:    stp q18, q5, [x1, #64]
+; CHECK-NEXT:    uunpklo z3.d, z6.s
+; CHECK-NEXT:    stp q7, q1, [x1, #224]
+; CHECK-NEXT:    stp q2, q3, [x1, #96]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: zext_v32i8_v32i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll
index 97f2e7a1e66cb..0c97eedd4362d 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll
@@ -973,11 +973,11 @@ define <4 x i32> @smulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) {
 ; NONEON-NOSVE-NEXT:    stp d0, d1, [sp, #48]
 ; NONEON-NOSVE-NEXT:    ldpsw x13, x12, [sp, #48]
 ; NONEON-NOSVE-NEXT:    smull x11, w11, w12
-; NONEON-NOSVE-NEXT:    ldpsw x12, x14, [sp, #56]
+; NONEON-NOSVE-NEXT:    ldpsw x14, x12, [sp, #56]
 ; NONEON-NOSVE-NEXT:    smull x10, w10, w13
 ; NONEON-NOSVE-NEXT:    lsr x11, x11, #32
-; NONEON-NOSVE-NEXT:    smull x9, w9, w14
-; NONEON-NOSVE-NEXT:    smull x8, w8, w12
+; NONEON-NOSVE-NEXT:    smull x9, w9, w12
+; NONEON-NOSVE-NEXT:    smull x8, w8, w14
 ; NONEON-NOSVE-NEXT:    lsr x10, x10, #32
 ; NONEON-NOSVE-NEXT:    lsr x9, x9, #32
 ; NONEON-NOSVE-NEXT:    stp w10, w11, [sp, #72]
@@ -1038,12 +1038,12 @@ define void @smulh_v8i32(ptr %a, ptr %b) {
 ; NONEON-NOSVE-NEXT:    stp d0, d1, [sp, #112]
 ; NONEON-NOSVE-NEXT:    ldpsw x17, x16, [sp, #112]
 ; NONEON-NOSVE-NEXT:    smull x15, w15, w16
-; NONEON-NOSVE-NEXT:    ldpsw x16, x18, [sp, #120]
+; NONEON-NOSVE-NEXT:    ldpsw x18, x16, [sp, #120]
 ; NONEON-NOSVE-NEXT:    smull x14, w14, w17
 ; NONEON-NOSVE-NEXT:    ldpsw x17, x1, [sp, #80]
-; NONEON-NOSVE-NEXT:    smull x13, w13, w18
+; NONEON-NOSVE-NEXT:    smull x13, w13, w16
 ; NONEON-NOSVE-NEXT:    lsr x15, x15, #32
-; NONEON-NOSVE-NEXT:    smull x12, w12, w16
+; NONEON-NOSVE-NEXT:    smull x12, w12, w18
 ; NONEON-NOSVE-NEXT:    lsr x14, x14, #32
 ; NONEON-NOSVE-NEXT:    ldpsw x16, x18, [sp, #88]
 ; NONEON-NOSVE-NEXT:    smull x11, w11, w1
@@ -2172,11 +2172,11 @@ define <4 x i32> @umulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) {
 ; NONEON-NOSVE-NEXT:    stp d0, d1, [sp, #48]
 ; NONEON-NOSVE-NEXT:    ldp w13, w12, [sp, #48]
 ; NONEON-NOSVE-NEXT:    umull x11, w11, w12
-; NONEON-NOSVE-NEXT:    ldp w12, w14, [sp, #56]
+; NONEON-NOSVE-NEXT:    ldp w14, w12, [sp, #56]
 ; NONEON-NOSVE-NEXT:    umull x10, w10, w13
 ; NONEON-NOSVE-NEXT:    lsr x11, x11, #32
-; NONEON-NOSVE-NEXT:    umull x9, w9, w14
-; NONEON-NOSVE-NEXT:    umull x8, w8, w12
+; NONEON-NOSVE-NEXT:    umull x9, w9, w12
+; NONEON-NOSVE-NEXT:    umull x8, w8, w14
 ; NONEON-NOSVE-NEXT:    lsr x10, x10, #32
 ; NONEON-NOSVE-NEXT:    lsr x9, x9, #32
 ; NONEON-NOSVE-NEXT:    stp w10, w11, [sp, #72]
@@ -2237,12 +2237,12 @@ define void @umulh_v8i32(ptr %a, ptr %b) {
 ; NONEON-NOSVE-NEXT:    stp d0, d1, [sp, #112]
 ; NONEON-NOSVE-NEXT:    ldp w17, w16, [sp, #112]
 ; NONEON-NOSVE-NEXT:    umull x15, w15, w16
-; NONEON-NOSVE-NEXT:    ldp w16, w18, [sp, #120]
+; NONEON-NOSVE-NEXT:    ldp w18, w16, [sp, #120]
 ; NONEON-NOSVE-NEXT:    umull x14, w14, w17
 ; NONEON-NOSVE-NEXT:    ldp w17, w1, [sp, #80]
-; NONEON-NOSVE-NEXT:    umull x13, w13, w18
+; NONEON-NOSVE-NEXT:    umull x13, w13, w16
 ; NONEON-NOSVE-NEXT:    lsr x15, x15, #32
-; NONEON-NOSVE-NEXT:    umull x12, w12, w16
+; NONEON-NOSVE-NEXT:    umull x12, w12, w18
 ; NONEON-NOSVE-NEXT:    lsr x14, x14, #32
 ; NONEON-NOSVE-NEXT:    ldp w16, w18, [sp, #88]
 ; NONEON-NOSVE-NEXT:    umull x11, w11, w1
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll
index 244dcc734bd7c..2678324728d0e 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll
@@ -855,11 +855,11 @@ define i32 @smaxv_v8i32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    cmp w11, w10
 ; NONEON-NOSVE-NEXT:    csel w9, w11, w10, gt
 ; NONEON-NOSVE-NEXT:    cmp w9, w8
-; NONEON-NOSVE-NEXT:    ldp w10, w12, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp w11, w12, [sp, #8]
 ; NONEON-NOSVE-NEXT:    csel w8, w9, w8, gt
-; NONEON-NOSVE-NEXT:    ldp w11, w9, [sp, #24]
-; NONEON-NOSVE-NEXT:    cmp w10, w11
-; NONEON-NOSVE-NEXT:    csel w10, w10, w11, gt
+; NONEON-NOSVE-NEXT:    ldp w10, w9, [sp, #24]
+; NONEON-NOSVE-NEXT:    cmp w11, w10
+; NONEON-NOSVE-NEXT:    csel w10, w11, w10, gt
 ; NONEON-NOSVE-NEXT:    cmp w8, w10
 ; NONEON-NOSVE-NEXT:    csel w8, w8, w10, gt
 ; NONEON-NOSVE-NEXT:    cmp w12, w9
@@ -1363,11 +1363,11 @@ define i32 @sminv_v8i32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    cmp w11, w10
 ; NONEON-NOSVE-NEXT:    csel w9, w11, w10, lt
 ; NONEON-NOSVE-NEXT:    cmp w9, w8
-; NONEON-NOSVE-NEXT:    ldp w10, w12, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp w11, w12, [sp, #8]
 ; NONEON-NOSVE-NEXT:    csel w8, w9, w8, lt
-; NONEON-NOSVE-NEXT:    ldp w11, w9, [sp, #24]
-; NONEON-NOSVE-NEXT:    cmp w10, w11
-; NONEON-NOSVE-NEXT:    csel w10, w10, w11, lt
+; NONEON-NOSVE-NEXT:    ldp w10, w9, [sp, #24]
+; NONEON-NOSVE-NEXT:    cmp w11, w10
+; NONEON-NOSVE-NEXT:    csel w10, w11, w10, lt
 ; NONEON-NOSVE-NEXT:    cmp w8, w10
 ; NONEON-NOSVE-NEXT:    csel w8, w8, w10, lt
 ; NONEON-NOSVE-NEXT:    cmp w12, w9
@@ -1871,11 +1871,11 @@ define i32 @umaxv_v8i32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    cmp w11, w10
 ; NONEON-NOSVE-NEXT:    csel w9, w11, w10, hi
 ; NONEON-NOSVE-NEXT:    cmp w9, w8
-; NONEON-NOSVE-NEXT:    ldp w10, w12, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp w11, w12, [sp, #8]
 ; NONEON-NOSVE-NEXT:    csel w8, w9, w8, hi
-; NONEON-NOSVE-NEXT:    ldp w11, w9, [sp, #24]
-; NONEON-NOSVE-NEXT:    cmp w10, w11
-; NONEON-NOSVE-NEXT:    csel w10, w10, w11, hi
+; NONEON-NOSVE-NEXT:    ldp w10, w9, [sp, #24]
+; NONEON-NOSVE-NEXT:    cmp w11, w10
+; NONEON-NOSVE-NEXT:    csel w10, w11, w10, hi
 ; NONEON-NOSVE-NEXT:    cmp w8, w10
 ; NONEON-NOSVE-NEXT:    csel w8, w8, w10, hi
 ; NONEON-NOSVE-NEXT:    cmp w12, w9
@@ -2379,11 +2379,11 @@ define i32 @uminv_v8i32(ptr %a) {
 ; NONEON-NOSVE-NEXT:    cmp w11, w10
 ; NONEON-NOSVE-NEXT:    csel w9, w11, w10, lo
 ; NONEON-NOSVE-NEXT:    cmp w9, w8
-; NONEON-NOSVE-NEXT:    ldp w10, w12, [sp, #8]
+; NONEON-NOSVE-NEXT:    ldp w11, w12, [sp, #8]
 ; NONEON-NOSVE-NEXT:    csel w8, w9, w8, lo
-; NONEON-NOSVE-NEXT:    ldp w11, w9, [sp, #24]
-; NONEON-NOSVE-NEXT:    cmp w10, w11
-; NONEON-NOSVE-NEXT:    csel w10, w10, w11, lo
+; NONEON-NOSVE-NEXT:    ldp w10, w9, [sp, #24]
+; NONEON-NOSVE-NEXT:    cmp w11, w10
+; NONEON-NOSVE-NEXT:    csel w10, w11, w10, lo
 ; NONEON-NOSVE-NEXT:    cmp w8, w10
 ; NONEON-NOSVE-NEXT:    csel w8, w8, w10, lo
 ; NONEON-NOSVE-NEXT:    cmp w12, w9
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll
index d61f92b406294..46a2ce6ed7109 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll
@@ -562,41 +562,42 @@ define void @ucvtf_v16i16_v16f64(ptr %a, ptr %b) {
 ; CHECK-NEXT:    ldp q1, q0, [x0]
 ; CHECK-NEXT:    ptrue p0.d, vl2
 ; CHECK-NEXT:    mov z2.d, z0.d
-; CHECK-NEXT:    uunpklo z3.s, z1.h
-; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
-; CHECK-NEXT:    uunpklo z0.s, z0.h
-; CHECK-NEXT:    ext z2.b, z2.b, z2.b, #8
+; CHECK-NEXT:    mov z3.d, z1.d
 ; CHECK-NEXT:    uunpklo z1.s, z1.h
-; CHECK-NEXT:    mov z5.d, z3.d
-; CHECK-NEXT:    uunpklo z4.d, z0.s
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    ext z2.b, z2.b, z0.b, #8
+; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    mov z4.d, z1.d
 ; CHECK-NEXT:    uunpklo z2.s, z2.h
-; CHECK-NEXT:    ext z5.b, z5.b, z3.b, #8
-; CHECK-NEXT:    mov z7.d, z1.d
-; CHECK-NEXT:    uunpklo z3.d, z3.s
-; CHECK-NEXT:    uunpklo z0.d, z0.s
-; CHECK-NEXT:    ucvtf z4.d, p0/m, z4.d
-; CHECK-NEXT:    mov z6.d, z2.d
-; CHECK-NEXT:    uunpklo z5.d, z5.s
-; CHECK-NEXT:    ext z7.b, z7.b, z1.b, #8
+; CHECK-NEXT:    uunpklo z3.s, z3.h
+; CHECK-NEXT:    uunpklo z5.d, z0.s
+; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    ext z4.b, z4.b, z1.b, #8
 ; CHECK-NEXT:    uunpklo z1.d, z1.s
-; CHECK-NEXT:    ucvtf z3.d, p0/m, z3.d
-; CHECK-NEXT:    ucvtf z0.d, p0/m, z0.d
-; CHECK-NEXT:    ext z6.b, z6.b, z2.b, #8
-; CHECK-NEXT:    uunpklo z2.d, z2.s
-; CHECK-NEXT:    uunpklo z7.d, z7.s
+; CHECK-NEXT:    mov z6.d, z2.d
+; CHECK-NEXT:    mov z7.d, z3.d
+; CHECK-NEXT:    uunpklo z0.d, z0.s
+; CHECK-NEXT:    uunpklo z4.d, z4.s
 ; CHECK-NEXT:    ucvtf z5.d, p0/m, z5.d
 ; CHECK-NEXT:    ucvtf z1.d, p0/m, z1.d
+; CHECK-NEXT:    ext z6.b, z6.b, z2.b, #8
+; CHECK-NEXT:    ext z7.b, z7.b, z3.b, #8
+; CHECK-NEXT:    uunpklo z2.d, z2.s
+; CHECK-NEXT:    uunpklo z3.d, z3.s
+; CHECK-NEXT:    ucvtf z0.d, p0/m, z0.d
+; CHECK-NEXT:    ucvtf z4.d, p0/m, z4.d
 ; CHECK-NEXT:    uunpklo z6.d, z6.s
-; CHECK-NEXT:    stp q4, q0, [x1, #64]
+; CHECK-NEXT:    uunpklo z7.d, z7.s
 ; CHECK-NEXT:    ucvtf z2.d, p0/m, z2.d
-; CHECK-NEXT:    stp q3, q5, [x1]
-; CHECK-NEXT:    movprfx z3, z7
-; CHECK-NEXT:    ucvtf z3.d, p0/m, z7.d
-; CHECK-NEXT:    movprfx z0, z6
-; CHECK-NEXT:    ucvtf z0.d, p0/m, z6.d
-; CHECK-NEXT:    stp q1, q3, [x1, #32]
-; CHECK-NEXT:    stp q2, q0, [x1, #96]
+; CHECK-NEXT:    stp q5, q0, [x1, #64]
+; CHECK-NEXT:    ucvtf z3.d, p0/m, z3.d
+; CHECK-NEXT:    stp q1, q4, [x1]
+; CHECK-NEXT:    movprfx z1, z6
+; CHECK-NEXT:    ucvtf z1.d, p0/m, z6.d
+; CHECK-NEXT:    movprfx z0, z7
+; CHECK-NEXT:    ucvtf z0.d, p0/m, z7.d
+; CHECK-NEXT:    stp q3, q0, [x1, #32]
+; CHECK-NEXT:    stp q2, q1, [x1, #96]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: ucvtf_v16i16_v16f64:
@@ -2000,41 +2001,42 @@ define void @scvtf_v16i16_v16f64(ptr %a, ptr %b) {
 ; CHECK-NEXT:    ldp q1, q0, [x0]
 ; CHECK-NEXT:    ptrue p0.d, vl2
 ; CHECK-NEXT:    mov z2.d, z0.d
-; CHECK-NEXT:    sunpklo z3.s, z1.h
-; CHECK-NEXT:    ext z1.b, z1.b, z1.b, #8
-; CHECK-NEXT:    sunpklo z0.s, z0.h
-; CHECK-NEXT:    ext z2.b, z2.b, z2.b, #8
+; CHECK-NEXT:    mov z3.d, z1.d
 ; CHECK-NEXT:    sunpklo z1.s, z1.h
-; CHECK-NEXT:    mov z5.d, z3.d
-; CHECK-NEXT:    sunpklo z4.d, z0.s
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    ext z2.b, z2.b, z0.b, #8
+; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #8
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    mov z4.d, z1.d
 ; CHECK-NEXT:    sunpklo z2.s, z2.h
-; CHECK-NEXT:    ext z5.b, z5.b, z3.b, #8
-; CHECK-NEXT:    mov z7.d, z1.d
-; CHECK-NEXT:    sunpklo z3.d, z3.s
-; CHECK-NEXT:    sunpklo z0.d, z0.s
-; CHECK-NEXT:    scvtf z4.d, p0/m, z4.d
-; CHECK-NEXT:    mov z6.d, z2.d
-; CHECK-NEXT:    sunpklo z5.d, z5.s
-; CHECK-NEXT:    ext z7.b, z7.b, z1.b, #8
+; CHECK-NEXT:    sunpklo z3.s, z3.h
+; CHECK-NEXT:    sunpklo z5.d, z0.s
+; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
+; CHECK-NEXT:    ext z4.b, z4.b, z1.b, #8
 ; CHECK-NEXT:    sunpklo z1.d, z1.s
-; CHECK-NEXT:    scvtf z3.d, p0/m, z3.d
-; CHECK-NEXT:    scvtf z0.d, p0/m, z0.d
-; CHECK-NEXT:    ext z6.b, z6.b, z2.b, #8
-; CHECK-NEXT:    sunpklo z2.d, z2.s
-; CHECK-NEXT:    sunpklo z7.d, z7.s
+; CHECK-NEXT:    mov z6.d, z2.d
+; CHECK-NEXT:    mov z7.d, z3.d
+; CHECK-NEXT:    sunpklo z0.d, z0.s
+; CHECK-NEXT:    sunpklo z4.d, z4.s
 ; CHECK-NEXT:    scvtf z5.d, p0/m, z5.d
 ; CHECK-NEXT:    scvtf z1.d, p0/m, z1.d
+; CHECK-NEXT:    ext z6.b, z6.b, z2.b, #8
+; CHECK-NEXT:    ext z7.b, z7.b, z3.b, #8
+; CHECK-NEXT:    sunpklo z2.d, z2.s
+; CHECK-NEXT:    sunpklo z3.d, z3.s
+; CHECK-NEXT:    scvtf z0.d, p0/m, z0.d
+; CHECK-NEXT:    scvtf z4.d, p0/m, z4.d
 ; CHECK-NEXT:    sunpklo z6.d, z6.s
-; CHECK-NEXT:    stp q4, q0, [x1, #64]
+; CHECK-NEXT:    sunpklo z7.d, z7.s
 ; CHECK-NEXT:    scvtf z2.d, p0/m, z2.d
-; CHECK-NEXT:    stp q3, q5, [x1]
-; CHECK-NEXT:    movprfx z3, z7
-; CHECK-NEXT:    scvtf z3.d, p0/m, z7.d
-; CHECK-NEXT:    movprfx z0, z6
-; CHECK-NEXT:    scvtf z0.d, p0/m, z6.d
-; CHECK-NEXT:    stp q1, q3, [x1, #32]
-; CHECK-NEXT:    stp q2, q0, [x1, #96]
+; CHECK-NEXT:    stp q5, q0, [x1, #64]
+; CHECK-NEXT:    scvtf z3.d, p0/m, z3.d
+; CHECK-NEXT:    stp q1, q4, [x1]
+; CHECK-NEXT:    movprfx z1, z6
+; CHECK-NEXT:    scvtf z1.d, p0/m, z6.d
+; CHECK-NEXT:    movprfx z0, z7
+; CHECK-NEXT:    scvtf z0.d, p0/m, z7.d
+; CHECK-NEXT:    stp q3, q0, [x1, #32]
+; CHECK-NEXT:    stp q2, q1, [x1, #96]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: scvtf_v16i16_v16f64:
@@ -2481,38 +2483,38 @@ define void @scvtf_v16i32_v16f64(ptr %a, ptr %b) {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ldp q1, q0, [x0, #32]
 ; CHECK-NEXT:    ptrue p0.d, vl2
-; CHECK-NEXT:    ldp q5, q4, [x0]
+; CHECK-NEXT:    ldp q5, q3, [x0]
 ; CHECK-NEXT:    mov z2.d, z0.d
-; CHECK-NEXT:    mov z3.d, z1.d
-; CHECK-NEXT:    mov z6.d, z4.d
+; CHECK-NEXT:    mov z4.d, z1.d
+; CHECK-NEXT:    mov z6.d, z3.d
 ; CHECK-NEXT:    mov z7.d, z5.d
 ; CHECK-NEXT:    ext z2.b, z2.b, z0.b, #8
-; CHECK-NEXT:    ext z3.b, z3.b, z1.b, #8
+; CHECK-NEXT:    ext z4.b, z4.b, z1.b, #8
 ; CHECK-NEXT:    sunpklo z0.d, z0.s
 ; CHECK-NEXT:    sunpklo z1.d, z1.s
-; CHECK-NEXT:    ext z6.b, z6.b, z4.b, #8
+; CHECK-NEXT:    ext z6.b, z6.b, z3.b, #8
 ; CHECK-NEXT:    ext z7.b, z7.b, z5.b, #8
-; CHECK-NEXT:    sunpklo z4.d, z4.s
+; CHECK-NEXT:    sunpklo z3.d, z3.s
 ; CHECK-NEXT:    sunpklo z5.d, z5.s
 ; CHECK-NEXT:    sunpklo z2.d, z2.s
-; CHECK-NEXT:    sunpklo z3.d, z3.s
+; CHECK-NEXT:    sunpklo z4.d, z4.s
 ; CHECK-NEXT:    scvtf z0.d, p0/m, z0.d
 ; CHECK-NEXT:    sunpklo z6.d, z6.s
 ; CHECK-NEXT:    sunpklo z7.d, z7.s
 ; CHECK-NEXT:    scvtf z1.d, p0/m, z1.d
-; CHECK-NEXT:    scvtf z4.d, p0/m, z4.d
-; CHECK-NEXT:    scvtf z2.d, p0/m, z2.d
 ; CHECK-NEXT:    scvtf z3.d, p0/m, z3.d
-; CHECK-NEXT:    stp q1, q3, [x1, #64]
-; CHECK-NEXT:    movprfx z1, z7
-; CHECK-NEXT:    scvtf z1.d, p0/m, z7.d
+; CHECK-NEXT:    scvtf z2.d, p0/m, z2.d
+; CHECK-NEXT:    scvtf z4.d, p0/m, z4.d
+; CHECK-NEXT:    stp q1, q4, [x1, #64]
+; CHECK-NEXT:    movprfx z1, z5
+; CHECK-NEXT:    scvtf z1.d, p0/m, z5.d
 ; CHECK-NEXT:    stp q0, q2, [x1, #96]
 ; CHECK-NEXT:    movprfx z0, z6
 ; CHECK-NEXT:    scvtf z0.d, p0/m, z6.d
-; CHECK-NEXT:    movprfx z2, z5
-; CHECK-NEXT:    scvtf z2.d, p0/m, z5.d
-; CHECK-NEXT:    stp q2, q1, [x1]
-; CHECK-NEXT:    stp q4, q0, [x1, #32]
+; CHECK-NEXT:    movprfx z2, z7
+; CHECK-NEXT:    scvtf z2.d, p0/m, z7.d
+; CHECK-NEXT:    stp q1, q2, [x1]
+; CHECK-NEXT:    stp q3, q0, [x1, #32]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: scvtf_v16i32_v16f64:
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll
index 41eb731fd66df..39701131d7db6 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll
@@ -288,12 +288,12 @@ define <16 x i8> @select_v16i8(<16 x i8> %op1, <16 x i8> %op2, <16 x i1> %mask)
 define void @select_v32i8(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v32i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.b, vl16
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    cmpeq p1.b, p0/z, z0.b, z1.b
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    cmpeq p1.b, p0/z, z1.b, z0.b
 ; CHECK-NEXT:    cmpeq p0.b, p0/z, z2.b, z3.b
-; CHECK-NEXT:    sel z0.b, p1, z0.b, z1.b
+; CHECK-NEXT:    mov z0.b, p1/m, z1.b
 ; CHECK-NEXT:    sel z1.b, p0, z2.b, z3.b
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
@@ -692,12 +692,12 @@ define <8 x i16> @select_v8i16(<8 x i16> %op1, <8 x i16> %op2, <8 x i1> %mask) {
 define void @select_v16i16(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v16i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.h, vl8
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    cmpeq p1.h, p0/z, z0.h, z1.h
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    cmpeq p1.h, p0/z, z1.h, z0.h
 ; CHECK-NEXT:    cmpeq p0.h, p0/z, z2.h, z3.h
-; CHECK-NEXT:    sel z0.h, p1, z0.h, z1.h
+; CHECK-NEXT:    mov z0.h, p1/m, z1.h
 ; CHECK-NEXT:    sel z1.h, p0, z2.h, z3.h
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
@@ -906,12 +906,12 @@ define <4 x i32> @select_v4i32(<4 x i32> %op1, <4 x i32> %op2, <4 x i1> %mask) {
 define void @select_v8i32(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v8i32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.s, vl4
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    cmpeq p1.s, p0/z, z0.s, z1.s
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    cmpeq p1.s, p0/z, z1.s, z0.s
 ; CHECK-NEXT:    cmpeq p0.s, p0/z, z2.s, z3.s
-; CHECK-NEXT:    sel z0.s, p1, z0.s, z1.s
+; CHECK-NEXT:    mov z0.s, p1/m, z1.s
 ; CHECK-NEXT:    sel z1.s, p0, z2.s, z3.s
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
@@ -930,16 +930,16 @@ define void @select_v8i32(ptr %a, ptr %b) {
 ; NONEON-NOSVE-NEXT:    cmp w10, w9
 ; NONEON-NOSVE-NEXT:    csel w9, w10, w9, eq
 ; NONEON-NOSVE-NEXT:    cmp w13, w12
-; NONEON-NOSVE-NEXT:    ldp w15, w16, [sp, #48]
+; NONEON-NOSVE-NEXT:    ldp w10, w16, [sp, #48]
 ; NONEON-NOSVE-NEXT:    csel w12, w13, w12, eq
 ; NONEON-NOSVE-NEXT:    cmp w14, w11
-; NONEON-NOSVE-NEXT:    ldp w10, w13, [sp, #32]
+; NONEON-NOSVE-NEXT:    ldp w15, w13, [sp, #32]
 ; NONEON-NOSVE-NEXT:    csel w11, w14, w11, eq
 ; NONEON-NOSVE-NEXT:    ldp w17, w14, [sp, #56]
 ; NONEON-NOSVE-NEXT:    ldp w18, w1, [sp, #40]
-; NONEON-NOSVE-NEXT:    cmp w10, w15
+; NONEON-NOSVE-NEXT:    cmp w15, w10
 ; NONEON-NOSVE-NEXT:    stp w12, w11, [sp, #72]
-; NONEON-NOSVE-NEXT:    csel w10, w10, w15, eq
+; NONEON-NOSVE-NEXT:    csel w10, w15, w10, eq
 ; NONEON-NOSVE-NEXT:    cmp w13, w16
 ; NONEON-NOSVE-NEXT:    ldr w15, [sp]
 ; NONEON-NOSVE-NEXT:    csel w13, w13, w16, eq
@@ -1039,12 +1039,12 @@ define <2 x i64> @select_v2i64(<2 x i64> %op1, <2 x i64> %op2, <2 x i1> %mask) {
 define void @select_v4i64(ptr %a, ptr %b) {
 ; CHECK-LABEL: select_v4i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.d, vl2
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    cmpeq p1.d, p0/z, z0.d, z1.d
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    cmpeq p1.d, p0/z, z1.d, z0.d
 ; CHECK-NEXT:    cmpeq p0.d, p0/z, z2.d, z3.d
-; CHECK-NEXT:    sel z0.d, p1, z0.d, z1.d
+; CHECK-NEXT:    mov z0.d, p1/m, z1.d
 ; CHECK-NEXT:    sel z1.d, p0, z2.d, z3.d
 ; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ret
@@ -1057,13 +1057,13 @@ define void @select_v4i64(ptr %a, ptr %b) {
 ; NONEON-NOSVE-NEXT:    .cfi_def_cfa_offset 96
 ; NONEON-NOSVE-NEXT:    stp q2, q3, [sp, #32]
 ; NONEON-NOSVE-NEXT:    ldr x9, [sp, #8]
-; NONEON-NOSVE-NEXT:    ldp x8, x11, [sp, #24]
+; NONEON-NOSVE-NEXT:    ldp x8, x10, [sp, #24]
 ; NONEON-NOSVE-NEXT:    ldr x13, [sp, #40]
-; NONEON-NOSVE-NEXT:    ldp x10, x12, [sp, #48]
+; NONEON-NOSVE-NEXT:    ldp x11, x12, [sp, #48]
 ; NONEON-NOSVE-NEXT:    cmp x9, x8
 ; NONEON-NOSVE-NEXT:    csel x8, x9, x8, eq
-; NONEON-NOSVE-NEXT:    cmp x11, x10
-; NONEON-NOSVE-NEXT:    csel x9, x11, x10, eq
+; NONEON-NOSVE-NEXT:    cmp x10, x11
+; NONEON-NOSVE-NEXT:    csel x9, x10, x11, eq
 ; NONEON-NOSVE-NEXT:    ldr x10, [sp, #16]
 ; NONEON-NOSVE-NEXT:    ldr x11, [sp]
 ; NONEON-NOSVE-NEXT:    cmp x13, x12
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll
index 3d9f407c3064c..e0e88c47fb55c 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll
@@ -151,20 +151,20 @@ define void @zip_v32i16(ptr %a, ptr %b) {
 ; CHECK-NEXT:    .cfi_offset b13, -48
 ; CHECK-NEXT:    .cfi_offset b14, -56
 ; CHECK-NEXT:    .cfi_offset b15, -64
-; CHECK-NEXT:    ldp q1, q0, [x0]
-; CHECK-NEXT:    ldp q2, q3, [x1]
-; CHECK-NEXT:    mov z5.h, z0.h[7]
-; CHECK-NEXT:    mov z7.h, z0.h[6]
-; CHECK-NEXT:    mov z17.h, z0.h[5]
-; CHECK-NEXT:    mov z4.h, z3.h[7]
-; CHECK-NEXT:    mov z6.h, z3.h[6]
-; CHECK-NEXT:    mov z16.h, z3.h[5]
-; CHECK-NEXT:    mov z18.h, z3.h[4]
-; CHECK-NEXT:    mov z19.h, z0.h[4]
-; CHECK-NEXT:    mov z20.h, z2.h[7]
-; CHECK-NEXT:    mov z21.h, z1.h[7]
-; CHECK-NEXT:    mov z22.h, z2.h[6]
-; CHECK-NEXT:    mov z23.h, z1.h[6]
+; CHECK-NEXT:    ldp q1, q2, [x1]
+; CHECK-NEXT:    ldp q0, q3, [x0]
+; CHECK-NEXT:    mov z4.h, z2.h[7]
+; CHECK-NEXT:    mov z6.h, z2.h[6]
+; CHECK-NEXT:    mov z16.h, z2.h[5]
+; CHECK-NEXT:    mov z5.h, z3.h[7]
+; CHECK-NEXT:    mov z7.h, z3.h[6]
+; CHECK-NEXT:    mov z17.h, z3.h[5]
+; CHECK-NEXT:    mov z18.h, z2.h[4]
+; CHECK-NEXT:    mov z19.h, z3.h[4]
+; CHECK-NEXT:    mov z20.h, z1.h[7]
+; CHECK-NEXT:    mov z21.h, z0.h[7]
+; CHECK-NEXT:    mov z22.h, z1.h[6]
+; CHECK-NEXT:    mov z23.h, z0.h[6]
 ; CHECK-NEXT:    zip1 z24.h, z5.h, z4.h
 ; CHECK-NEXT:    zip1 z25.h, z7.h, z6.h
 ; CHECK-NEXT:    zip1 z16.h, z17.h, z16.h
@@ -174,12 +174,12 @@ define void @zip_v32i16(ptr %a, ptr %b) {
 ; CHECK-NEXT:    zip1 z18.h, z21.h, z20.h
 ; CHECK-NEXT:    zip1 z21.s, z25.s, z24.s
 ; CHECK-NEXT:    zip1 z22.h, z23.h, z22.h
-; CHECK-NEXT:    mov z23.h, z2.h[5]
+; CHECK-NEXT:    mov z23.h, z1.h[5]
 ; CHECK-NEXT:    mov z20.h, z6.h[7]
-; CHECK-NEXT:    mov z24.h, z1.h[5]
-; CHECK-NEXT:    mov z25.h, z2.h[4]
+; CHECK-NEXT:    mov z24.h, z0.h[5]
+; CHECK-NEXT:    mov z25.h, z1.h[4]
 ; CHECK-NEXT:    mov z19.h, z7.h[7]
-; CHECK-NEXT:    mov z26.h, z1.h[4]
+; CHECK-NEXT:    mov z26.h, z0.h[4]
 ; CHECK-NEXT:    mov z27.h, z6.h[6]
 ; CHECK-NEXT:    mov z28.h, z7.h[5]
 ; CHECK-NEXT:    mov z29.h, z6.h[5]
@@ -212,22 +212,22 @@ define void @zip_v32i16(ptr %a, ptr %b) {
 ; CHECK-NEXT:    zip1 z19.s, z28.s, z27.s
 ; CHECK-NEXT:    zip1 z18.s, z22.s, z18.s
 ; CHECK-NEXT:    zip1 z20.s, z24.s, z23.s
-; CHECK-NEXT:    zip1 z0.h, z0.h, z3.h
+; CHECK-NEXT:    zip1 z2.h, z3.h, z2.h
 ; CHECK-NEXT:    zip1 z3.s, z26.s, z25.s
 ; CHECK-NEXT:    zip1 z22.s, z30.s, z29.s
 ; CHECK-NEXT:    zip1 z6.h, z6.h, z7.h
 ; CHECK-NEXT:    zip1 z7.d, z16.d, z21.d
 ; CHECK-NEXT:    zip1 z16.d, z19.d, z17.d
-; CHECK-NEXT:    zip1 z1.h, z1.h, z2.h
-; CHECK-NEXT:    zip1 z2.h, z4.h, z5.h
+; CHECK-NEXT:    zip1 z0.h, z0.h, z1.h
+; CHECK-NEXT:    zip1 z1.h, z4.h, z5.h
 ; CHECK-NEXT:    zip1 z4.d, z20.d, z18.d
 ; CHECK-NEXT:    zip1 z3.d, z22.d, z3.d
-; CHECK-NEXT:    add z0.h, z0.h, z6.h
+; CHECK-NEXT:    add z2.h, z2.h, z6.h
 ; CHECK-NEXT:    add z5.h, z7.h, z16.h
-; CHECK-NEXT:    add z1.h, z1.h, z2.h
-; CHECK-NEXT:    add z2.h, z4.h, z3.h
-; CHECK-NEXT:    stp q0, q5, [x0, #32]
-; CHECK-NEXT:    stp q1, q2, [x0]
+; CHECK-NEXT:    add z0.h, z0.h, z1.h
+; CHECK-NEXT:    add z1.h, z4.h, z3.h
+; CHECK-NEXT:    stp q2, q5, [x0, #32]
+; CHECK-NEXT:    stp q0, q1, [x0]
 ; CHECK-NEXT:    ldp d15, d14, [sp], #64 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 ;
@@ -659,10 +659,10 @@ define void @zip1_v8i32_undef(ptr %a) {
 define void @trn_v32i8(ptr %a, ptr %b) {
 ; CHECK-LABEL: trn_v32i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    trn1 z4.b, z0.b, z1.b
-; CHECK-NEXT:    trn2 z0.b, z0.b, z1.b
+; CHECK-NEXT:    ldp q0, q3, [x1]
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    trn1 z4.b, z1.b, z0.b
+; CHECK-NEXT:    trn2 z0.b, z1.b, z0.b
 ; CHECK-NEXT:    trn1 z1.b, z2.b, z3.b
 ; CHECK-NEXT:    trn2 z2.b, z2.b, z3.b
 ; CHECK-NEXT:    add z0.b, z4.b, z0.b
@@ -862,10 +862,10 @@ define void @trn_v8i16(ptr %a, ptr %b) {
 define void @trn_v16i16(ptr %a, ptr %b) {
 ; CHECK-LABEL: trn_v16i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    trn1 z4.h, z0.h, z1.h
-; CHECK-NEXT:    trn2 z0.h, z0.h, z1.h
+; CHECK-NEXT:    ldp q0, q3, [x1]
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    trn1 z4.h, z1.h, z0.h
+; CHECK-NEXT:    trn2 z0.h, z1.h, z0.h
 ; CHECK-NEXT:    trn1 z1.h, z2.h, z3.h
 ; CHECK-NEXT:    trn2 z2.h, z2.h, z3.h
 ; CHECK-NEXT:    add z0.h, z4.h, z0.h
@@ -961,10 +961,10 @@ define void @trn_v16i16(ptr %a, ptr %b) {
 define void @trn_v8i32(ptr %a, ptr %b) {
 ; CHECK-LABEL: trn_v8i32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    zip1 z4.s, z0.s, z1.s
-; CHECK-NEXT:    trn2 z0.s, z0.s, z1.s
+; CHECK-NEXT:    ldp q0, q3, [x1]
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    zip1 z4.s, z1.s, z0.s
+; CHECK-NEXT:    trn2 z0.s, z1.s, z0.s
 ; CHECK-NEXT:    trn1 z1.s, z2.s, z3.s
 ; CHECK-NEXT:    trn2 z2.s, z2.s, z3.s
 ; CHECK-NEXT:    add z0.s, z4.s, z0.s
@@ -1006,11 +1006,11 @@ define void @trn_v8i32(ptr %a, ptr %b) {
 define void @trn_v4f64(ptr %a, ptr %b) {
 ; CHECK-LABEL: trn_v4f64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q0, q2, [x0]
+; CHECK-NEXT:    ldp q0, q3, [x1]
 ; CHECK-NEXT:    ptrue p0.d, vl2
-; CHECK-NEXT:    ldp q1, q3, [x1]
-; CHECK-NEXT:    zip1 z4.d, z0.d, z1.d
-; CHECK-NEXT:    trn2 z0.d, z0.d, z1.d
+; CHECK-NEXT:    ldp q1, q2, [x0]
+; CHECK-NEXT:    zip1 z4.d, z1.d, z0.d
+; CHECK-NEXT:    trn2 z0.d, z1.d, z0.d
 ; CHECK-NEXT:    zip1 z1.d, z2.d, z3.d
 ; CHECK-NEXT:    trn2 z2.d, z2.d, z3.d
 ; CHECK-NEXT:    fadd z0.d, p0/m, z0.d, z4.d
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll
index e07036f2a1acf..90466e3cebd5e 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll
@@ -125,14 +125,14 @@ define i1 @ptest_or_v16i1(ptr %a, ptr %b) {
 ; CHECK-NEXT:    ldp q2, q3, [x1, #32]
 ; CHECK-NEXT:    ldp q4, q5, [x0]
 ; CHECK-NEXT:    fcmne p1.s, p0/z, z1.s, #0.0
-; CHECK-NEXT:    ldp q1, q6, [x1]
+; CHECK-NEXT:    ldp q6, q1, [x1]
 ; CHECK-NEXT:    fcmne p3.s, p0/z, z3.s, #0.0
 ; CHECK-NEXT:    fcmne p2.s, p0/z, z0.s, #0.0
 ; CHECK-NEXT:    fcmne p5.s, p0/z, z2.s, #0.0
 ; CHECK-NEXT:    fcmne p4.s, p0/z, z5.s, #0.0
 ; CHECK-NEXT:    fcmne p7.s, p0/z, z4.s, #0.0
-; CHECK-NEXT:    fcmne p6.s, p0/z, z6.s, #0.0
-; CHECK-NEXT:    fcmne p0.s, p0/z, z1.s, #0.0
+; CHECK-NEXT:    fcmne p6.s, p0/z, z1.s, #0.0
+; CHECK-NEXT:    fcmne p0.s, p0/z, z6.s, #0.0
 ; CHECK-NEXT:    mov z0.s, p1/z, #-1 // =0xffffffffffffffff
 ; CHECK-NEXT:    mov z2.s, p3/z, #-1 // =0xffffffffffffffff
 ; CHECK-NEXT:    mov z1.s, p2/z, #-1 // =0xffffffffffffffff
@@ -334,14 +334,14 @@ define i1 @ptest_and_v16i1(ptr %a, ptr %b) {
 ; CHECK-NEXT:    ldp q2, q3, [x1, #32]
 ; CHECK-NEXT:    ldp q4, q5, [x0]
 ; CHECK-NEXT:    fcmne p1.s, p0/z, z1.s, #0.0
-; CHECK-NEXT:    ldp q1, q6, [x1]
+; CHECK-NEXT:    ldp q6, q1, [x1]
 ; CHECK-NEXT:    fcmne p3.s, p0/z, z3.s, #0.0
 ; CHECK-NEXT:    fcmne p2.s, p0/z, z0.s, #0.0
 ; CHECK-NEXT:    fcmne p5.s, p0/z, z2.s, #0.0
 ; CHECK-NEXT:    fcmne p4.s, p0/z, z5.s, #0.0
 ; CHECK-NEXT:    fcmne p7.s, p0/z, z4.s, #0.0
-; CHECK-NEXT:    fcmne p6.s, p0/z, z6.s, #0.0
-; CHECK-NEXT:    fcmne p0.s, p0/z, z1.s, #0.0
+; CHECK-NEXT:    fcmne p6.s, p0/z, z1.s, #0.0
+; CHECK-NEXT:    fcmne p0.s, p0/z, z6.s, #0.0
 ; CHECK-NEXT:    mov z0.s, p1/z, #-1 // =0xffffffffffffffff
 ; CHECK-NEXT:    mov z2.s, p3/z, #-1 // =0xffffffffffffffff
 ; CHECK-NEXT:    mov z1.s, p2/z, #-1 // =0xffffffffffffffff
diff --git a/llvm/test/CodeGen/AArch64/vec_uaddo.ll b/llvm/test/CodeGen/AArch64/vec_uaddo.ll
index b29195eed9149..2f51208e49351 100644
--- a/llvm/test/CodeGen/AArch64/vec_uaddo.ll
+++ b/llvm/test/CodeGen/AArch64/vec_uaddo.ll
@@ -278,8 +278,8 @@ define <2 x i32> @uaddo_v2i128(<2 x i128> %a0, <2 x i128> %a1, ptr %p2) nounwind
 ; CHECK-NEXT:    fmov s0, w13
 ; CHECK-NEXT:    mov v0.s[1], w10
 ; CHECK-NEXT:    ldr x10, [sp]
-; CHECK-NEXT:    stp x8, x9, [x10, #16]
 ; CHECK-NEXT:    stp x11, x12, [x10]
+; CHECK-NEXT:    stp x8, x9, [x10, #16]
 ; CHECK-NEXT:    shl v0.2s, v0.2s, #31
 ; CHECK-NEXT:    cmlt v0.2s, v0.2s, #0
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/vec_umulo.ll b/llvm/test/CodeGen/AArch64/vec_umulo.ll
index 12ea8862a03cd..935f4272218af 100644
--- a/llvm/test/CodeGen/AArch64/vec_umulo.ll
+++ b/llvm/test/CodeGen/AArch64/vec_umulo.ll
@@ -340,12 +340,12 @@ define <2 x i32> @umulo_v2i128(<2 x i128> %a0, <2 x i128> %a1, ptr %p2) nounwind
 ; CHECK-NEXT:    csinc w11, w12, wzr, lo
 ; CHECK-NEXT:    ldr x12, [sp]
 ; CHECK-NEXT:    fmov s0, w11
-; CHECK-NEXT:    mul x11, x0, x4
+; CHECK-NEXT:    mul x11, x2, x6
 ; CHECK-NEXT:    mov v0.s[1], w8
-; CHECK-NEXT:    mul x8, x2, x6
-; CHECK-NEXT:    stp x11, x10, [x12]
+; CHECK-NEXT:    mul x8, x0, x4
+; CHECK-NEXT:    stp x11, x9, [x12, #16]
 ; CHECK-NEXT:    shl v0.2s, v0.2s, #31
-; CHECK-NEXT:    stp x8, x9, [x12, #16]
+; CHECK-NEXT:    stp x8, x10, [x12]
 ; CHECK-NEXT:    cmlt v0.2s, v0.2s, #0
 ; CHECK-NEXT:    ret
   %t = call {<2 x i128>, <2 x i1>} @llvm.umul.with.overflow.v2i128(<2 x i128> %a0, <2 x i128> %a1)
diff --git a/llvm/test/CodeGen/AArch64/vselect-ext.ll b/llvm/test/CodeGen/AArch64/vselect-ext.ll
index 0b90343a40c83..76b7f3d9dfc0e 100644
--- a/llvm/test/CodeGen/AArch64/vselect-ext.ll
+++ b/llvm/test/CodeGen/AArch64/vselect-ext.ll
@@ -334,7 +334,7 @@ define <16 x i32> @same_zext_used_in_cmp_unsigned_pred_and_select_other_use(<16
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    movi.16b v16, #10
 ; CHECK-NEXT:    ushll.8h v19, v0, #0
-; CHECK-NEXT:    ldr q21, [sp]
+; CHECK-NEXT:    ldr q22, [sp]
 ; CHECK-NEXT:    ushll.4s v24, v19, #0
 ; CHECK-NEXT:    ushll2.4s v19, v19, #0
 ; CHECK-NEXT:    cmhi.16b v16, v0, v16
@@ -345,33 +345,33 @@ define <16 x i32> @same_zext_used_in_cmp_unsigned_pred_and_select_other_use(<16
 ; CHECK-NEXT:    ushll2.4s v0, v0, #0
 ; CHECK-NEXT:    sshll2.4s v18, v17, #0
 ; CHECK-NEXT:    sshll.4s v17, v17, #0
-; CHECK-NEXT:    sshll2.4s v22, v16, #0
+; CHECK-NEXT:    sshll2.4s v20, v16, #0
 ; CHECK-NEXT:    sshll.4s v16, v16, #0
-; CHECK-NEXT:    sshll2.2d v20, v18, #0
+; CHECK-NEXT:    sshll2.2d v21, v18, #0
 ; CHECK-NEXT:    sshll.2d v23, v18, #0
 ; CHECK-NEXT:    sshll2.2d v26, v17, #0
-; CHECK-NEXT:    sshll.2d v27, v17, #0
-; CHECK-NEXT:    and.16b v20, v21, v20
-; CHECK-NEXT:    sshll2.2d v21, v22, #0
+; CHECK-NEXT:    sshll2.2d v27, v20, #0
+; CHECK-NEXT:    and.16b v21, v22, v21
+; CHECK-NEXT:    sshll.2d v22, v17, #0
 ; CHECK-NEXT:    and.16b v7, v7, v23
-; CHECK-NEXT:    sshll.2d v23, v22, #0
+; CHECK-NEXT:    sshll.2d v23, v20, #0
 ; CHECK-NEXT:    and.16b v6, v6, v26
 ; CHECK-NEXT:    sshll2.2d v26, v16, #0
-; CHECK-NEXT:    and.16b v5, v5, v27
-; CHECK-NEXT:    stp q7, q20, [x0, #96]
-; CHECK-NEXT:    sshll.2d v20, v16, #0
-; CHECK-NEXT:    and.16b v21, v4, v21
+; CHECK-NEXT:    and.16b v27, v4, v27
 ; CHECK-NEXT:    and.16b v4, v0, v18
+; CHECK-NEXT:    and.16b v0, v24, v16
+; CHECK-NEXT:    stp q7, q21, [x0, #96]
+; CHECK-NEXT:    sshll.2d v21, v16, #0
+; CHECK-NEXT:    and.16b v5, v5, v22
 ; CHECK-NEXT:    and.16b v7, v3, v23
-; CHECK-NEXT:    and.16b v3, v19, v22
+; CHECK-NEXT:    and.16b v3, v19, v20
 ; CHECK-NEXT:    stp q5, q6, [x0, #64]
-; CHECK-NEXT:    and.16b v0, v24, v16
 ; CHECK-NEXT:    and.16b v6, v2, v26
 ; CHECK-NEXT:    and.16b v2, v25, v17
-; CHECK-NEXT:    and.16b v5, v1, v20
+; CHECK-NEXT:    and.16b v5, v1, v21
 ; CHECK-NEXT:    mov.16b v1, v3
 ; CHECK-NEXT:    mov.16b v3, v4
-; CHECK-NEXT:    stp q7, q21, [x0, #32]
+; CHECK-NEXT:    stp q7, q27, [x0, #32]
 ; CHECK-NEXT:    stp q5, q6, [x0]
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll b/llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll
index 531e0fa740da7..92fd4fe30980c 100644
--- a/llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll
+++ b/llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll
@@ -168,14 +168,13 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; ALL-NEXT:    stp q0, q0, [sp, #32]
 ; ALL-NEXT:    eor x12, x12, #0x3f
 ; ALL-NEXT:    add x8, x9, x8
-; ALL-NEXT:    ldp x13, x11, [x8]
-; ALL-NEXT:    ldr x9, [x8, #24]
-; ALL-NEXT:    ldr x8, [x8, #16]
-; ALL-NEXT:    lsl x14, x9, #1
+; ALL-NEXT:    ldp x13, x9, [x8]
+; ALL-NEXT:    ldp x8, x11, [x8, #16]
+; ALL-NEXT:    lsl x15, x9, #1
 ; ALL-NEXT:    lsr x9, x9, x10
-; ALL-NEXT:    lsl x15, x11, #1
-; ALL-NEXT:    lsr x11, x11, x10
 ; ALL-NEXT:    lsr x13, x13, x10
+; ALL-NEXT:    lsl x14, x11, #1
+; ALL-NEXT:    lsr x11, x11, x10
 ; ALL-NEXT:    lsl x14, x14, x12
 ; ALL-NEXT:    lsl x12, x15, x12
 ; ALL-NEXT:    lsl x15, x8, #1
@@ -183,10 +182,10 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; ALL-NEXT:    mvn w10, w10
 ; ALL-NEXT:    lsl x10, x15, x10
 ; ALL-NEXT:    orr x8, x14, x8
-; ALL-NEXT:    stp x8, x9, [x2, #16]
-; ALL-NEXT:    orr x9, x12, x13
-; ALL-NEXT:    orr x8, x11, x10
-; ALL-NEXT:    stp x9, x8, [x2]
+; ALL-NEXT:    stp x8, x11, [x2, #16]
+; ALL-NEXT:    orr x11, x12, x13
+; ALL-NEXT:    orr x8, x9, x10
+; ALL-NEXT:    stp x11, x8, [x2]
 ; ALL-NEXT:    add sp, sp, #64
 ; ALL-NEXT:    ret
   %src = load i256, ptr %src.ptr, align 1
@@ -213,14 +212,13 @@ define void @shl_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; ALL-NEXT:    str q0, [sp]
 ; ALL-NEXT:    eor x12, x12, #0x3f
 ; ALL-NEXT:    sub x8, x9, x8
-; ALL-NEXT:    ldp x11, x13, [x8, #16]
-; ALL-NEXT:    ldr x9, [x8]
-; ALL-NEXT:    ldr x8, [x8, #8]
-; ALL-NEXT:    lsr x15, x9, #1
+; ALL-NEXT:    ldp x9, x13, [x8, #16]
+; ALL-NEXT:    ldp x11, x8, [x8]
+; ALL-NEXT:    lsr x14, x9, #1
 ; ALL-NEXT:    lsl x9, x9, x10
-; ALL-NEXT:    lsr x14, x11, #1
-; ALL-NEXT:    lsl x11, x11, x10
 ; ALL-NEXT:    lsl x13, x13, x10
+; ALL-NEXT:    lsr x15, x11, #1
+; ALL-NEXT:    lsl x11, x11, x10
 ; ALL-NEXT:    lsr x14, x14, x12
 ; ALL-NEXT:    lsr x12, x15, x12
 ; ALL-NEXT:    lsr x15, x8, #1
@@ -228,10 +226,10 @@ define void @shl_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; ALL-NEXT:    mvn w10, w10
 ; ALL-NEXT:    lsr x10, x15, x10
 ; ALL-NEXT:    orr x8, x8, x12
-; ALL-NEXT:    stp x9, x8, [x2]
-; ALL-NEXT:    orr x9, x13, x14
-; ALL-NEXT:    orr x8, x11, x10
-; ALL-NEXT:    stp x8, x9, [x2, #16]
+; ALL-NEXT:    stp x11, x8, [x2]
+; ALL-NEXT:    orr x11, x13, x14
+; ALL-NEXT:    orr x8, x9, x10
+; ALL-NEXT:    stp x8, x11, [x2, #16]
 ; ALL-NEXT:    add sp, sp, #64
 ; ALL-NEXT:    ret
   %src = load i256, ptr %src.ptr, align 1
@@ -258,14 +256,13 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; ALL-NEXT:    eor x12, x12, #0x3f
 ; ALL-NEXT:    stp x8, x8, [sp, #32]
 ; ALL-NEXT:    add x8, x11, x9
-; ALL-NEXT:    ldp x13, x11, [x8]
-; ALL-NEXT:    ldr x9, [x8, #24]
-; ALL-NEXT:    ldr x8, [x8, #16]
-; ALL-NEXT:    lsl x14, x9, #1
-; ALL-NEXT:    asr x9, x9, x10
-; ALL-NEXT:    lsl x15, x11, #1
-; ALL-NEXT:    lsr x11, x11, x10
+; ALL-NEXT:    ldp x13, x9, [x8]
+; ALL-NEXT:    ldp x8, x11, [x8, #16]
+; ALL-NEXT:    lsl x15, x9, #1
+; ALL-NEXT:    lsr x9, x9, x10
 ; ALL-NEXT:    lsr x13, x13, x10
+; ALL-NEXT:    lsl x14, x11, #1
+; ALL-NEXT:    asr x11, x11, x10
 ; ALL-NEXT:    lsl x14, x14, x12
 ; ALL-NEXT:    lsl x12, x15, x12
 ; ALL-NEXT:    lsl x15, x8, #1
@@ -273,10 +270,10 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; ALL-NEXT:    mvn w10, w10
 ; ALL-NEXT:    lsl x10, x15, x10
 ; ALL-NEXT:    orr x8, x14, x8
-; ALL-NEXT:    stp x8, x9, [x2, #16]
-; ALL-NEXT:    orr x9, x12, x13
-; ALL-NEXT:    orr x8, x11, x10
-; ALL-NEXT:    stp x9, x8, [x2]
+; ALL-NEXT:    stp x8, x11, [x2, #16]
+; ALL-NEXT:    orr x11, x12, x13
+; ALL-NEXT:    orr x8, x9, x10
+; ALL-NEXT:    stp x11, x8, [x2]
 ; ALL-NEXT:    add sp, sp, #64
 ; ALL-NEXT:    ret
   %src = load i256, ptr %src.ptr, align 1
diff --git a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
index eb83aa5a13e52..75c5bee2ae0ab 100644
--- a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
@@ -1486,17 +1486,17 @@ define void @zext_v16i32_to_v16i64_in_loop(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    cmp x8, #512
 ; CHECK-NEXT:    ldp q5, q4, [x9]
 ; CHECK-NEXT:    ushll2.2d v2, v0, #0
-; CHECK-NEXT:    ushll.2d v0, v0, #0
 ; CHECK-NEXT:    ushll2.2d v3, v1, #0
+; CHECK-NEXT:    ushll.2d v0, v0, #0
 ; CHECK-NEXT:    ushll.2d v1, v1, #0
 ; CHECK-NEXT:    stp q0, q2, [x1, #96]
 ; CHECK-NEXT:    ushll2.2d v2, v4, #0
-; CHECK-NEXT:    ushll.2d v0, v4, #0
+; CHECK-NEXT:    ushll2.2d v0, v5, #0
 ; CHECK-NEXT:    stp q1, q3, [x1, #64]
-; CHECK-NEXT:    ushll2.2d v3, v5, #0
+; CHECK-NEXT:    ushll.2d v3, v4, #0
 ; CHECK-NEXT:    ushll.2d v1, v5, #0
-; CHECK-NEXT:    stp q0, q2, [x1, #32]
-; CHECK-NEXT:    stp q1, q3, [x1], #128
+; CHECK-NEXT:    stp q3, q2, [x1, #32]
+; CHECK-NEXT:    stp q1, q0, [x1], #128
 ; CHECK-NEXT:    b.ne LBB15_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
 ; CHECK-NEXT:    ret
@@ -1683,26 +1683,26 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldp d2, d3, [x9, #-8]
 ; CHECK-NEXT:    subs x10, x10, #16
-; CHECK-NEXT:    ldp q6, q5, [x8, #-32]
+; CHECK-NEXT:    ldp q7, q5, [x8, #-32]
 ; CHECK-NEXT:    add x9, x9, #16
-; CHECK-NEXT:    ldp q17, q16, [x8, #-64]
+; CHECK-NEXT:    ldp q17, q6, [x8, #-64]
 ; CHECK-NEXT:    tbl.16b v4, { v2 }, v1
 ; CHECK-NEXT:    tbl.16b v2, { v2 }, v0
-; CHECK-NEXT:    tbl.16b v7, { v3 }, v1
+; CHECK-NEXT:    tbl.16b v16, { v3 }, v1
 ; CHECK-NEXT:    tbl.16b v3, { v3 }, v0
 ; CHECK-NEXT:    uaddw2.2d v5, v5, v4
-; CHECK-NEXT:    uaddw.2d v4, v6, v4
-; CHECK-NEXT:    uaddw2.2d v6, v16, v2
-; CHECK-NEXT:    ldp q18, q16, [x8, #32]
+; CHECK-NEXT:    uaddw2.2d v6, v6, v2
+; CHECK-NEXT:    uaddw.2d v4, v7, v4
+; CHECK-NEXT:    ldp q18, q7, [x8, #32]
 ; CHECK-NEXT:    uaddw.2d v2, v17, v2
 ; CHECK-NEXT:    stp q4, q5, [x8, #-32]
-; CHECK-NEXT:    uaddw2.2d v5, v16, v7
-; CHECK-NEXT:    ldp q16, q4, [x8]
-; CHECK-NEXT:    uaddw.2d v7, v18, v7
+; CHECK-NEXT:    uaddw2.2d v5, v7, v16
 ; CHECK-NEXT:    stp q2, q6, [x8, #-64]
-; CHECK-NEXT:    uaddw2.2d v4, v4, v3
-; CHECK-NEXT:    uaddw.2d v2, v16, v3
-; CHECK-NEXT:    stp q7, q5, [x8, #32]
+; CHECK-NEXT:    uaddw.2d v16, v18, v16
+; CHECK-NEXT:    ldp q7, q6, [x8]
+; CHECK-NEXT:    stp q16, q5, [x8, #32]
+; CHECK-NEXT:    uaddw2.2d v4, v6, v3
+; CHECK-NEXT:    uaddw.2d v2, v7, v3
 ; CHECK-NEXT:    stp q2, q4, [x8], #128
 ; CHECK-NEXT:    b.ne LBB17_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
@@ -1826,34 +1826,34 @@ define void @zext_v16i8_to_v16i64_in_sequence_in_loop(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    ushll.8h v1, v1, #0
 ; CHECK-NEXT:    ushll2.4s v3, v2, #0
 ; CHECK-NEXT:    ushll.4s v2, v2, #0
-; CHECK-NEXT:    ushll2.4s v5, v0, #0
+; CHECK-NEXT:    ushll2.4s v4, v0, #0
 ; CHECK-NEXT:    ushll.4s v0, v0, #0
-; CHECK-NEXT:    ushll2.2d v4, v3, #0
+; CHECK-NEXT:    ushll2.2d v5, v3, #0
 ; CHECK-NEXT:    ushll.2d v3, v3, #0
 ; CHECK-NEXT:    ushll2.2d v7, v2, #0
+; CHECK-NEXT:    ushll2.2d v16, v4, #0
 ; CHECK-NEXT:    ushll.2d v2, v2, #0
-; CHECK-NEXT:    stp q3, q4, [x9, #-32]
-; CHECK-NEXT:    ushll2.2d v4, v5, #0
+; CHECK-NEXT:    ushll.2d v4, v4, #0
+; CHECK-NEXT:    stp q3, q5, [x9, #-32]
 ; CHECK-NEXT:    ushll2.4s v3, v6, #0
-; CHECK-NEXT:    ushll.2d v5, v5, #0
-; CHECK-NEXT:    stp q2, q7, [x9, #-64]
-; CHECK-NEXT:    ushll2.2d v7, v0, #0
+; CHECK-NEXT:    ushll2.2d v5, v0, #0
 ; CHECK-NEXT:    ushll.2d v0, v0, #0
-; CHECK-NEXT:    ushll.4s v2, v6, #0
-; CHECK-NEXT:    stp q5, q4, [x9, #-96]
-; CHECK-NEXT:    ushll2.2d v4, v3, #0
-; CHECK-NEXT:    ushll2.4s v5, v1, #0
+; CHECK-NEXT:    stp q4, q16, [x9, #-96]
+; CHECK-NEXT:    ushll.4s v6, v6, #0
+; CHECK-NEXT:    stp q2, q7, [x9, #-64]
+; CHECK-NEXT:    ushll2.4s v4, v1, #0
+; CHECK-NEXT:    ushll2.2d v2, v3, #0
 ; CHECK-NEXT:    ushll.2d v3, v3, #0
-; CHECK-NEXT:    stp q0, q7, [x9, #-128]
+; CHECK-NEXT:    stp q0, q5, [x9, #-128]
 ; CHECK-NEXT:    ushll.4s v0, v1, #0
-; CHECK-NEXT:    ushll2.2d v6, v2, #0
-; CHECK-NEXT:    ushll.2d v1, v2, #0
-; CHECK-NEXT:    ushll2.2d v2, v5, #0
-; CHECK-NEXT:    stp q3, q4, [x9, #96]
-; CHECK-NEXT:    ushll.2d v3, v5, #0
+; CHECK-NEXT:    ushll2.2d v5, v6, #0
+; CHECK-NEXT:    ushll.2d v1, v6, #0
+; CHECK-NEXT:    stp q3, q2, [x9, #96]
+; CHECK-NEXT:    ushll2.2d v2, v4, #0
+; CHECK-NEXT:    ushll.2d v3, v4, #0
 ; CHECK-NEXT:    ushll2.2d v4, v0, #0
 ; CHECK-NEXT:    ushll.2d v0, v0, #0
-; CHECK-NEXT:    stp q1, q6, [x9, #64]
+; CHECK-NEXT:    stp q1, q5, [x9, #64]
 ; CHECK-NEXT:    stp q3, q2, [x9, #32]
 ; CHECK-NEXT:    stp q0, q4, [x9], #128
 ; CHECK-NEXT:    b.ne LBB18_1
@@ -2678,9 +2678,9 @@ define void @zext_v8i8_to_v8i33_in_loop(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    orr x10, x10, x12, lsl #4
 ; CHECK-NEXT:    fmov x12, d3
 ; CHECK-NEXT:    stp x10, x9, [x1, #16]
+; CHECK-NEXT:    fmov x9, d0
 ; CHECK-NEXT:    orr x11, x11, x12, lsl #2
-; CHECK-NEXT:    fmov x12, d0
-; CHECK-NEXT:    orr x9, x12, x13, lsl #33
+; CHECK-NEXT:    orr x9, x9, x13, lsl #33
 ; CHECK-NEXT:    stp x9, x11, [x1], #128
 ; CHECK-NEXT:    b.ne LBB22_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
@@ -2913,28 +2913,29 @@ define i32 @test_widening_instr_mull_64(ptr %p1, ptr %p2, i32 %h) {
 ; CHECK-NEXT:  LBB25_1: ; %loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldr q4, [x0]
-; CHECK-NEXT:    ldp q16, q7, [x1, #32]
-; CHECK-NEXT:    ldr q18, [x8, #16]!
+; CHECK-NEXT:    ldp q17, q16, [x1, #32]
+; CHECK-NEXT:    ldr q18, [x1]
 ; CHECK-NEXT:    subs w2, w2, #1
 ; CHECK-NEXT:    tbl.16b v5, { v4 }, v3
 ; CHECK-NEXT:    tbl.16b v6, { v4 }, v0
-; CHECK-NEXT:    tbl.16b v17, { v4 }, v2
-; CHECK-NEXT:    tbl.16b v4, { v4 }, v1
-; CHECK-NEXT:    umull2.2d v19, v5, v7
-; CHECK-NEXT:    umull.2d v5, v5, v7
-; CHECK-NEXT:    ldr q7, [x1]
-; CHECK-NEXT:    umull2.2d v20, v6, v16
-; CHECK-NEXT:    umull2.2d v21, v17, v18
-; CHECK-NEXT:    umull.2d v17, v17, v18
-; CHECK-NEXT:    umull2.2d v18, v4, v7
-; CHECK-NEXT:    umull.2d v4, v4, v7
+; CHECK-NEXT:    tbl.16b v7, { v4 }, v1
+; CHECK-NEXT:    tbl.16b v4, { v4 }, v2
+; CHECK-NEXT:    ldr q21, [x8, #16]!
 ; CHECK-NEXT:    mov x1, x8
-; CHECK-NEXT:    stp q5, q19, [x0, #96]
-; CHECK-NEXT:    umull.2d v5, v6, v16
+; CHECK-NEXT:    umull2.2d v19, v5, v16
+; CHECK-NEXT:    umull2.2d v20, v6, v17
+; CHECK-NEXT:    umull2.2d v22, v7, v18
+; CHECK-NEXT:    umull.2d v5, v5, v16
+; CHECK-NEXT:    umull2.2d v16, v4, v21
+; CHECK-NEXT:    umull.2d v4, v4, v21
+; CHECK-NEXT:    umull.2d v7, v7, v18
+; CHECK-NEXT:    umull.2d v6, v6, v17
 ; CHECK-NEXT:    str q20, [x0, #80]
-; CHECK-NEXT:    stp q4, q18, [x0]
-; CHECK-NEXT:    stp q17, q21, [x0, #32]
-; CHECK-NEXT:    str q5, [x0, #64]!
+; CHECK-NEXT:    stp q22, q4, [x0, #16]
+; CHECK-NEXT:    stp q5, q19, [x0, #96]
+; CHECK-NEXT:    str q16, [x0, #48]
+; CHECK-NEXT:    str q7, [x0]
+; CHECK-NEXT:    str q6, [x0, #64]!
 ; CHECK-NEXT:    b.ne LBB25_1
 ; CHECK-NEXT:  ; %bb.2: ; %exit
 ; CHECK-NEXT:    mov w0, wzr
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
index 27b93872b9f1d..b67080bd4798d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
@@ -40,17 +40,17 @@ define void @add_v3i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX9-NEXT:    global_load_ushort v6, v[0:1], off
-; GFX9-NEXT:    global_load_ushort v7, v[0:1], off offset:4
-; GFX9-NEXT:    global_load_ushort v8, v[2:3], off
-; GFX9-NEXT:    global_load_ushort v9, v[2:3], off offset:4
+; GFX9-NEXT:    global_load_ushort v7, v[2:3], off
+; GFX9-NEXT:    global_load_ushort v8, v[2:3], off offset:4
+; GFX9-NEXT:    global_load_ushort v9, v[0:1], off offset:4
 ; GFX9-NEXT:    global_load_ushort v10, v[0:1], off offset:2
 ; GFX9-NEXT:    global_load_ushort v11, v[2:3], off offset:2
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_and_b32_e32 v0, 0xffff, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_and_b32_e32 v1, 0xffff, v8
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_and_b32_e32 v1, 0xffff, v7
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_pk_add_u16 v2, v7, v9
+; GFX9-NEXT:    v_pk_add_u16 v2, v9, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshl_or_b32 v0, v10, 16, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -206,10 +206,10 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX9-NEXT:    global_load_ushort v6, v[0:1], off
 ; GFX9-NEXT:    global_load_ushort v7, v[0:1], off offset:4
-; GFX9-NEXT:    global_load_ushort v8, v[0:1], off offset:8
-; GFX9-NEXT:    global_load_ushort v9, v[2:3], off
-; GFX9-NEXT:    global_load_ushort v10, v[2:3], off offset:4
-; GFX9-NEXT:    global_load_ushort v11, v[2:3], off offset:8
+; GFX9-NEXT:    global_load_ushort v8, v[2:3], off
+; GFX9-NEXT:    global_load_ushort v9, v[2:3], off offset:4
+; GFX9-NEXT:    global_load_ushort v10, v[2:3], off offset:8
+; GFX9-NEXT:    global_load_ushort v11, v[0:1], off offset:8
 ; GFX9-NEXT:    global_load_ushort v12, v[0:1], off offset:2
 ; GFX9-NEXT:    global_load_ushort v13, v[0:1], off offset:6
 ; GFX9-NEXT:    global_load_ushort v14, v[2:3], off offset:2
@@ -218,12 +218,12 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX9-NEXT:    v_and_b32_e32 v0, 0xffff, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
 ; GFX9-NEXT:    v_and_b32_e32 v1, 0xffff, v7
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    v_and_b32_e32 v2, 0xffff, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_and_b32_e32 v2, 0xffff, v9
-; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_and_b32_e32 v3, 0xffff, v10
+; GFX9-NEXT:    v_and_b32_e32 v3, 0xffff, v9
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_pk_add_u16 v6, v8, v11
+; GFX9-NEXT:    v_pk_add_u16 v6, v11, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_lshl_or_b32 v0, v12, 16, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
@@ -421,11 +421,11 @@ define void @addv_7i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX9-NEXT:    global_load_ushort v6, v[0:1], off
 ; GFX9-NEXT:    global_load_ushort v7, v[0:1], off offset:4
 ; GFX9-NEXT:    global_load_ushort v8, v[0:1], off offset:8
-; GFX9-NEXT:    global_load_ushort v9, v[0:1], off offset:12
-; GFX9-NEXT:    global_load_ushort v10, v[2:3], off
-; GFX9-NEXT:    global_load_ushort v11, v[2:3], off offset:4
-; GFX9-NEXT:    global_load_ushort v12, v[2:3], off offset:8
-; GFX9-NEXT:    global_load_ushort v13, v[2:3], off offset:12
+; GFX9-NEXT:    global_load_ushort v9, v[2:3], off
+; GFX9-NEXT:    global_load_ushort v10, v[2:3], off offset:4
+; GFX9-NEXT:    global_load_ushort v11, v[2:3], off offset:8
+; GFX9-NEXT:    global_load_ushort v12, v[2:3], off offset:12
+; GFX9-NEXT:    global_load_ushort v13, v[0:1], off offset:12
 ; GFX9-NEXT:    global_load_ushort v14, v[0:1], off offset:2
 ; GFX9-NEXT:    global_load_ushort v15, v[0:1], off offset:6
 ; GFX9-NEXT:    global_load_ushort v16, v[0:1], off offset:10
@@ -438,14 +438,14 @@ define void @addv_7i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX9-NEXT:    v_and_b32_e32 v1, 0xffff, v7
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
 ; GFX9-NEXT:    v_and_b32_e32 v2, 0xffff, v8
+; GFX9-NEXT:    s_waitcnt vmcnt(10)
+; GFX9-NEXT:    v_and_b32_e32 v3, 0xffff, v9
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_and_b32_e32 v3, 0xffff, v10
+; GFX9-NEXT:    v_and_b32_e32 v6, 0xffff, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_and_b32_e32 v6, 0xffff, v11
-; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_and_b32_e32 v7, 0xffff, v12
+; GFX9-NEXT:    v_and_b32_e32 v7, 0xffff, v11
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_pk_add_u16 v8, v9, v13
+; GFX9-NEXT:    v_pk_add_u16 v8, v13, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_lshl_or_b32 v0, v14, 16, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
@@ -720,8 +720,8 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX9-NEXT:    global_load_ushort v14, v[0:1], off offset:16
 ; GFX9-NEXT:    global_load_ushort v15, v[2:3], off offset:16
 ; GFX9-NEXT:    global_load_dwordx4 v[10:13], v[2:3], off
-; GFX9-NEXT:    global_load_ushort v16, v[0:1], off offset:20
-; GFX9-NEXT:    global_load_ushort v17, v[2:3], off offset:20
+; GFX9-NEXT:    global_load_ushort v16, v[2:3], off offset:20
+; GFX9-NEXT:    global_load_ushort v17, v[0:1], off offset:20
 ; GFX9-NEXT:    global_load_ushort v18, v[0:1], off offset:18
 ; GFX9-NEXT:    global_load_ushort v19, v[2:3], off offset:18
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
@@ -738,7 +738,7 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_lshl_or_b32 v8, v19, 16, v15
 ; GFX9-NEXT:    global_store_dwordx4 v[4:5], v[0:3], off
-; GFX9-NEXT:    v_pk_add_u16 v6, v16, v17
+; GFX9-NEXT:    v_pk_add_u16 v6, v17, v16
 ; GFX9-NEXT:    v_pk_add_u16 v0, v7, v8
 ; GFX9-NEXT:    global_store_short v[4:5], v0, off offset:16
 ; GFX9-NEXT:    global_store_short_d16_hi v[4:5], v0, off offset:18
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll
index 493e8cef63890..634b71c56c4be 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll
@@ -1745,12 +1745,12 @@ define i65 @v_ashr_i65(i65 %value, i65 %amount) {
 ; GFX10-NEXT:    v_or_b32_e32 v2, v6, v8
 ; GFX10-NEXT:    v_or_b32_e32 v8, v7, v9
 ; GFX10-NEXT:    v_ashrrev_i64 v[6:7], v3, v[4:5]
-; GFX10-NEXT:    v_ashrrev_i32_e32 v3, 31, v5
+; GFX10-NEXT:    v_ashrrev_i32_e32 v4, 31, v5
 ; GFX10-NEXT:    v_cndmask_b32_e32 v2, v10, v2, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v11, v8, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v11, v8, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, v2, v0, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v4, v1, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v3, v6, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v3, v1, s4
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v4, v6, vcc_lo
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_ashr_i65:
@@ -1758,22 +1758,21 @@ define i65 @v_ashr_i65(i65 %value, i65 %amount) {
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    v_bfe_i32 v4, v2, 0, 1
 ; GFX11-NEXT:    v_sub_nc_u32_e32 v2, 64, v3
-; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v3
 ; GFX11-NEXT:    v_lshrrev_b64 v[6:7], v3, v[0:1]
 ; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v3
-; GFX11-NEXT:    v_ashrrev_i32_e32 v5, 31, v4
 ; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v3
+; GFX11-NEXT:    v_ashrrev_i32_e32 v5, 31, v4
 ; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v2, v[4:5]
-; GFX11-NEXT:    v_ashrrev_i64 v[10:11], v10, v[4:5]
 ; GFX11-NEXT:    v_or_b32_e32 v2, v6, v8
 ; GFX11-NEXT:    v_or_b32_e32 v8, v7, v9
+; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v3
 ; GFX11-NEXT:    v_ashrrev_i64 v[6:7], v3, v[4:5]
-; GFX11-NEXT:    v_ashrrev_i32_e32 v3, 31, v5
-; GFX11-NEXT:    v_cndmask_b32_e32 v2, v10, v2, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e32 v4, v11, v8, vcc_lo
+; GFX11-NEXT:    v_ashrrev_i64 v[10:11], v10, v[4:5]
+; GFX11-NEXT:    v_ashrrev_i32_e32 v4, 31, v5
+; GFX11-NEXT:    v_dual_cndmask_b32 v2, v10, v2 :: v_dual_cndmask_b32 v3, v11, v8
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v2, v0, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v4, v1, s0
-; GFX11-NEXT:    v_cndmask_b32_e32 v2, v3, v6, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v3, v1, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v2, v4, v6, vcc_lo
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %result = ashr i65 %value, %amount
   ret i65 %result
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll
index 9ef16aef0dd16..9b35920f8547a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll
@@ -347,30 +347,30 @@ define i64 @dyn_extract_v8i64_const_s_v(i32 %sel) {
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s4
 ; GFX10-NEXT:    v_mov_b32_e32 v2, s5
 ; GFX10-NEXT:    s_mov_b64 s[6:7], 1
-; GFX10-NEXT:    s_mov_b64 s[4:5], 3
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 2, v0
+; GFX10-NEXT:    s_mov_b64 s[8:9], 3
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, s6, v1, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v2, s7, v2, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 2, v0
-; GFX10-NEXT:    s_mov_b64 s[6:7], 4
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s5, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 3, v0
-; GFX10-NEXT:    s_mov_b64 s[4:5], 5
+; GFX10-NEXT:    s_mov_b64 s[6:7], 4
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s9, s4
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 4, v0
+; GFX10-NEXT:    s_mov_b64 s[8:9], 5
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s6, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s7, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 4, v0
-; GFX10-NEXT:    s_mov_b64 s[6:7], 6
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s5, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 5, v0
-; GFX10-NEXT:    s_mov_b64 s[4:5], 7
+; GFX10-NEXT:    s_mov_b64 s[6:7], 6
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s9, s4
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 6, v0
+; GFX10-NEXT:    s_mov_b64 s[8:9], 7
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s6, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s7, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 6, v0
-; GFX10-NEXT:    s_mov_b64 s[6:7], 8
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s5, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 7, v0
+; GFX10-NEXT:    s_mov_b64 s[6:7], 8
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s9, s4
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, v1, s6, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v2, s7, vcc_lo
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
@@ -382,30 +382,30 @@ define i64 @dyn_extract_v8i64_const_s_v(i32 %sel) {
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v0
 ; GFX11-NEXT:    v_dual_mov_b32 v1, s0 :: v_dual_mov_b32 v2, s1
 ; GFX11-NEXT:    s_mov_b64 s[2:3], 1
-; GFX11-NEXT:    s_mov_b64 s[0:1], 3
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 2, v0
+; GFX11-NEXT:    s_mov_b64 s[4:5], 3
 ; GFX11-NEXT:    v_cndmask_b32_e32 v1, s2, v1, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e32 v2, s3, v2, vcc_lo
-; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 2, v0
-; GFX11-NEXT:    s_mov_b64 s[2:3], 4
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s0, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s1, vcc_lo
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 3, v0
-; GFX11-NEXT:    s_mov_b64 s[0:1], 5
+; GFX11-NEXT:    s_mov_b64 s[2:3], 4
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s4, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s5, s0
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 4, v0
+; GFX11-NEXT:    s_mov_b64 s[4:5], 5
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s2, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s3, vcc_lo
-; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 4, v0
-; GFX11-NEXT:    s_mov_b64 s[2:3], 6
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s0, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s1, vcc_lo
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 5, v0
-; GFX11-NEXT:    s_mov_b64 s[0:1], 7
+; GFX11-NEXT:    s_mov_b64 s[2:3], 6
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s4, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s5, s0
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 6, v0
+; GFX11-NEXT:    s_mov_b64 s[4:5], 7
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s2, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s3, vcc_lo
-; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 6, v0
-; GFX11-NEXT:    s_mov_b64 s[2:3], 8
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s0, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s1, vcc_lo
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 7, v0
+; GFX11-NEXT:    s_mov_b64 s[2:3], 8
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s4, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s5, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v1, s2, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v2, s3, vcc_lo
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
index 07fcb02d98649..aeeb4ad3758c1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
@@ -5990,98 +5990,98 @@ define i128 @v_fshl_i128(i128 %lhs, i128 %rhs, i128 %amt) {
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    v_and_b32_e32 v18, 0x7f, v8
-; GFX10-NEXT:    v_not_b32_e32 v10, v8
+; GFX10-NEXT:    v_not_b32_e32 v12, v8
 ; GFX10-NEXT:    v_lshrrev_b64 v[4:5], 1, v[4:5]
-; GFX10-NEXT:    v_lshrrev_b64 v[12:13], 1, v[6:7]
-; GFX10-NEXT:    v_sub_nc_u32_e32 v11, 64, v18
-; GFX10-NEXT:    v_and_b32_e32 v19, 0x7f, v10
+; GFX10-NEXT:    v_sub_nc_u32_e32 v10, 64, v18
+; GFX10-NEXT:    v_and_b32_e32 v19, 0x7f, v12
+; GFX10-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v18
 ; GFX10-NEXT:    v_lshlrev_b64 v[8:9], v18, v[2:3]
 ; GFX10-NEXT:    v_lshl_or_b32 v5, v6, 31, v5
-; GFX10-NEXT:    v_add_nc_u32_e32 v20, 0xffffffc0, v18
-; GFX10-NEXT:    v_lshrrev_b64 v[10:11], v11, v[0:1]
+; GFX10-NEXT:    v_lshrrev_b64 v[10:11], v10, v[0:1]
+; GFX10-NEXT:    v_lshrrev_b64 v[6:7], 1, v[6:7]
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v16, 64, v19
-; GFX10-NEXT:    v_lshlrev_b64 v[6:7], v18, v[0:1]
-; GFX10-NEXT:    v_lshrrev_b64 v[14:15], v19, v[4:5]
-; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v20, v[0:1]
+; GFX10-NEXT:    v_lshlrev_b64 v[12:13], v18, v[0:1]
+; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v14, v[0:1]
 ; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v18
-; GFX10-NEXT:    v_or_b32_e32 v10, v10, v8
-; GFX10-NEXT:    v_add_nc_u32_e32 v8, 0xffffffc0, v19
-; GFX10-NEXT:    v_lshlrev_b64 v[16:17], v16, v[12:13]
+; GFX10-NEXT:    v_or_b32_e32 v8, v10, v8
+; GFX10-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v19
+; GFX10-NEXT:    v_lshrrev_b64 v[14:15], v19, v[4:5]
+; GFX10-NEXT:    v_lshlrev_b64 v[16:17], v16, v[6:7]
 ; GFX10-NEXT:    v_or_b32_e32 v11, v11, v9
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s4, 64, v19
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v0, v10, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v8, v[12:13]
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 0, v19
-; GFX10-NEXT:    v_or_b32_e32 v14, v14, v16
-; GFX10-NEXT:    v_or_b32_e32 v15, v15, v17
+; GFX10-NEXT:    v_cndmask_b32_e32 v20, v0, v8, vcc_lo
+; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v10, v[6:7]
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s5, 64, v19
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v18
+; GFX10-NEXT:    v_or_b32_e32 v0, v14, v16
+; GFX10-NEXT:    v_or_b32_e32 v10, v15, v17
 ; GFX10-NEXT:    v_cndmask_b32_e32 v11, v1, v11, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v19, v[12:13]
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v18
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v8, v14, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, v9, v15, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, 0, v6, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, 0, v7, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, v8, v4, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v9, v5, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, 0, v0, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, 0, v1, s4
-; GFX10-NEXT:    v_or_b32_e32 v0, v6, v4
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v19
+; GFX10-NEXT:    v_cndmask_b32_e32 v12, 0, v12, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, v8, v0, s5
+; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v19, v[6:7]
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v9, v10, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v7, 0, v13, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v20, v2, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v8, v4, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v6, v5, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, 0, v0, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, 0, v1, s5
+; GFX10-NEXT:    v_or_b32_e32 v0, v12, v4
 ; GFX10-NEXT:    v_or_b32_e32 v1, v7, v5
-; GFX10-NEXT:    v_or_b32_e32 v2, v2, v8
-; GFX10-NEXT:    v_or_b32_e32 v3, v3, v9
+; GFX10-NEXT:    v_or_b32_e32 v2, v2, v6
+; GFX10-NEXT:    v_or_b32_e32 v3, v3, v8
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_fshl_i128:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    v_and_b32_e32 v18, 0x7f, v8
-; GFX11-NEXT:    v_not_b32_e32 v10, v8
+; GFX11-NEXT:    v_not_b32_e32 v12, v8
 ; GFX11-NEXT:    v_lshrrev_b64 v[4:5], 1, v[4:5]
-; GFX11-NEXT:    v_lshrrev_b64 v[12:13], 1, v[6:7]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_sub_nc_u32_e32 v11, 64, v18
-; GFX11-NEXT:    v_and_b32_e32 v19, 0x7f, v10
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_sub_nc_u32_e32 v10, 64, v18
+; GFX11-NEXT:    v_and_b32_e32 v19, 0x7f, v12
 ; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v18, v[2:3]
+; GFX11-NEXT:    v_lshlrev_b64 v[12:13], v18, v[0:1]
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v18
+; GFX11-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v18
+; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v10, v[0:1]
 ; GFX11-NEXT:    v_lshl_or_b32 v5, v6, 31, v5
-; GFX11-NEXT:    v_lshlrev_b64 v[6:7], v18, v[0:1]
-; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v11, v[0:1]
+; GFX11-NEXT:    v_lshrrev_b64 v[6:7], 1, v[6:7]
 ; GFX11-NEXT:    v_sub_nc_u32_e32 v16, 64, v19
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v18
-; GFX11-NEXT:    v_add_nc_u32_e32 v20, 0xffffffc0, v18
+; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v14, v[0:1]
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v19
+; GFX11-NEXT:    v_or_b32_e32 v8, v10, v8
+; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v19
 ; GFX11-NEXT:    v_lshrrev_b64 v[14:15], v19, v[4:5]
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v19
-; GFX11-NEXT:    v_or_b32_e32 v10, v10, v8
-; GFX11-NEXT:    v_cndmask_b32_e32 v7, 0, v7, vcc_lo
-; GFX11-NEXT:    v_add_nc_u32_e32 v8, 0xffffffc0, v19
-; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v16, v[12:13]
-; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v20, v[0:1]
+; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v16, v[6:7]
 ; GFX11-NEXT:    v_or_b32_e32 v11, v11, v9
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v19
-; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v8, v[12:13]
-; GFX11-NEXT:    v_cndmask_b32_e32 v6, 0, v6, vcc_lo
-; GFX11-NEXT:    v_or_b32_e32 v14, v14, v16
-; GFX11-NEXT:    v_or_b32_e32 v15, v15, v17
-; GFX11-NEXT:    v_dual_cndmask_b32 v10, v0, v10 :: v_dual_cndmask_b32 v11, v1, v11
-; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v19, v[12:13]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, v8, v14, s0
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v18
-; GFX11-NEXT:    v_cndmask_b32_e64 v9, v9, v15, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v4, v8, v4, s1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v5, v9, v5, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, 0, v0, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v9, 0, v1, s0
-; GFX11-NEXT:    v_or_b32_e32 v0, v6, v4
+; GFX11-NEXT:    v_cndmask_b32_e32 v20, v0, v8, vcc_lo
+; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v10, v[6:7]
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v19
+; GFX11-NEXT:    v_cndmask_b32_e32 v12, 0, v12, vcc_lo
+; GFX11-NEXT:    v_or_b32_e32 v0, v14, v16
+; GFX11-NEXT:    v_or_b32_e32 v10, v15, v17
+; GFX11-NEXT:    v_cndmask_b32_e32 v11, v1, v11, vcc_lo
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v18
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v8, v8, v0, s1
+; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v19, v[6:7]
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v9, v10, s1
+; GFX11-NEXT:    v_cndmask_b32_e32 v7, 0, v13, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v20, v2, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v4, v8, v4, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v5, v6, v5, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, 0, v0, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v8, 0, v1, s1
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v0, v12, v4
 ; GFX11-NEXT:    v_or_b32_e32 v1, v7, v5
-; GFX11-NEXT:    v_or_b32_e32 v2, v2, v8
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-NEXT:    v_or_b32_e32 v3, v3, v9
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v2, v2, v6
+; GFX11-NEXT:    v_or_b32_e32 v3, v3, v8
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %result = call i128 @llvm.fshl.i128(i128 %lhs, i128 %rhs, i128 %amt)
   ret i128 %result
@@ -6249,45 +6249,45 @@ define amdgpu_ps <4 x float> @v_fshl_i128_ssv(i128 inreg %lhs, i128 inreg %rhs,
 ; GFX10-LABEL: v_fshl_i128_ssv:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    v_and_b32_e32 v12, 0x7f, v0
-; GFX10-NEXT:    v_not_b32_e32 v2, v0
+; GFX10-NEXT:    v_not_b32_e32 v6, v0
 ; GFX10-NEXT:    s_mov_b32 s8, 0
 ; GFX10-NEXT:    s_lshr_b64 s[4:5], s[4:5], 1
 ; GFX10-NEXT:    s_lshl_b32 s9, s6, 31
-; GFX10-NEXT:    v_sub_nc_u32_e32 v3, 64, v12
-; GFX10-NEXT:    v_and_b32_e32 v13, 0x7f, v2
+; GFX10-NEXT:    v_sub_nc_u32_e32 v2, 64, v12
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v12, s[2:3]
+; GFX10-NEXT:    v_and_b32_e32 v13, 0x7f, v6
+; GFX10-NEXT:    v_add_nc_u32_e32 v7, 0xffffffc0, v12
 ; GFX10-NEXT:    s_or_b64 s[8:9], s[4:5], s[8:9]
+; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v2, s[0:1]
 ; GFX10-NEXT:    s_lshr_b64 s[6:7], s[6:7], 1
-; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v3, s[0:1]
-; GFX10-NEXT:    v_sub_nc_u32_e32 v8, 64, v13
-; GFX10-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v12
-; GFX10-NEXT:    v_lshrrev_b64 v[6:7], v13, s[8:9]
 ; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v12
+; GFX10-NEXT:    v_lshlrev_b64 v[6:7], v7, s[0:1]
+; GFX10-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v13
+; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v13, s[8:9]
+; GFX10-NEXT:    v_or_b32_e32 v0, v2, v0
+; GFX10-NEXT:    v_sub_nc_u32_e32 v2, 64, v13
 ; GFX10-NEXT:    v_lshlrev_b64 v[4:5], v12, s[0:1]
-; GFX10-NEXT:    v_or_b32_e32 v2, v2, v0
-; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0xffffffc0, v13
-; GFX10-NEXT:    v_lshlrev_b64 v[8:9], v8, s[6:7]
-; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v10, s[0:1]
-; GFX10-NEXT:    v_or_b32_e32 v3, v3, v1
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s0, 64, v13
-; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v0, s[6:7]
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s1, 0, v13
-; GFX10-NEXT:    v_or_b32_e32 v6, v6, v8
-; GFX10-NEXT:    v_or_b32_e32 v7, v7, v9
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v10, v2, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v11, v3, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v13, s[6:7]
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v6, s0
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v12
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v7, s0
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s1, 64, v13
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s0, 0, v12
+; GFX10-NEXT:    v_cndmask_b32_e32 v6, v6, v0, vcc_lo
+; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v2, s[6:7]
+; GFX10-NEXT:    v_or_b32_e32 v2, v3, v1
+; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v14, s[6:7]
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v13
 ; GFX10-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v5, 0, v5, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s8, s1
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v8, s2, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v7, v10, s3, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s9, s1
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s0
+; GFX10-NEXT:    v_or_b32_e32 v3, v8, v10
+; GFX10-NEXT:    v_or_b32_e32 v8, v9, v11
+; GFX10-NEXT:    v_cndmask_b32_e32 v7, v7, v2, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, s2, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v3, s1
+; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v13, s[6:7]
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v8, s1
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, s3, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s8, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s9, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s1
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s1
 ; GFX10-NEXT:    v_or_b32_e32 v0, v4, v0
 ; GFX10-NEXT:    v_or_b32_e32 v1, v5, v1
 ; GFX10-NEXT:    v_or_b32_e32 v2, v6, v2
@@ -6297,51 +6297,56 @@ define amdgpu_ps <4 x float> @v_fshl_i128_ssv(i128 inreg %lhs, i128 inreg %rhs,
 ; GFX11-LABEL: v_fshl_i128_ssv:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    v_and_b32_e32 v12, 0x7f, v0
-; GFX11-NEXT:    v_not_b32_e32 v2, v0
-; GFX11-NEXT:    s_mov_b32 s8, 0
-; GFX11-NEXT:    s_lshr_b64 s[4:5], s[4:5], 1
+; GFX11-NEXT:    v_not_b32_e32 v6, v0
 ; GFX11-NEXT:    s_lshl_b32 s9, s6, 31
-; GFX11-NEXT:    v_lshlrev_b64 v[4:5], v12, s[0:1]
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v12
-; GFX11-NEXT:    v_and_b32_e32 v13, 0x7f, v2
-; GFX11-NEXT:    s_or_b64 s[8:9], s[4:5], s[8:9]
 ; GFX11-NEXT:    s_lshr_b64 s[6:7], s[6:7], 1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
-; GFX11-NEXT:    v_dual_cndmask_b32 v4, 0, v4 :: v_dual_cndmask_b32 v5, 0, v5
-; GFX11-NEXT:    v_sub_nc_u32_e32 v3, 64, v12
+; GFX11-NEXT:    s_mov_b32 s8, 0
+; GFX11-NEXT:    v_sub_nc_u32_e32 v2, 64, v12
 ; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v12, s[2:3]
-; GFX11-NEXT:    v_sub_nc_u32_e32 v8, 64, v13
-; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v12
-; GFX11-NEXT:    v_lshrrev_b64 v[6:7], v13, s[8:9]
-; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v3, s[0:1]
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s4, 0, v12
-; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v8, s[6:7]
-; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v10, s[0:1]
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v13
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v13
-; GFX11-NEXT:    v_or_b32_e32 v2, v2, v0
-; GFX11-NEXT:    v_add_nc_u32_e32 v0, 0xffffffc0, v13
-; GFX11-NEXT:    v_or_b32_e32 v3, v3, v1
-; GFX11-NEXT:    v_or_b32_e32 v6, v6, v8
-; GFX11-NEXT:    v_or_b32_e32 v7, v7, v9
-; GFX11-NEXT:    v_cndmask_b32_e32 v8, v10, v2, vcc_lo
-; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v0, s[6:7]
-; GFX11-NEXT:    v_cndmask_b32_e32 v10, v11, v3, vcc_lo
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v12
+; GFX11-NEXT:    v_and_b32_e32 v13, 0x7f, v6
+; GFX11-NEXT:    v_add_nc_u32_e32 v7, 0xffffffc0, v12
+; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v2, s[0:1]
+; GFX11-NEXT:    s_lshr_b64 s[4:5], s[4:5], 1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    s_or_b64 s[8:9], s[4:5], s[8:9]
+; GFX11-NEXT:    v_lshlrev_b64 v[6:7], v7, s[0:1]
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_or_b32_e32 v0, v2, v0
+; GFX11-NEXT:    v_sub_nc_u32_e32 v2, 64, v13
+; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v2, s[6:7]
+; GFX11-NEXT:    v_or_b32_e32 v2, v3, v1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_cndmask_b32_e32 v7, v7, v2, vcc_lo
+; GFX11-NEXT:    v_lshlrev_b64 v[4:5], v12, s[0:1]
+; GFX11-NEXT:    v_cndmask_b32_e32 v6, v6, v0, vcc_lo
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v12
+; GFX11-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc_lo
+; GFX11-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v13
+; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v13, s[8:9]
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v13
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s4, 0, v13
+; GFX11-NEXT:    v_cndmask_b32_e32 v5, 0, v5, vcc_lo
+; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v14, s[6:7]
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, s2, s0
+; GFX11-NEXT:    v_or_b32_e32 v3, v8, v10
+; GFX11-NEXT:    v_or_b32_e32 v8, v9, v11
+; GFX11-NEXT:    v_cndmask_b32_e64 v7, v7, s3, s0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v3, s1
 ; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v13, s[6:7]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v6, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, v7, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, v8, s2, s4
-; GFX11-NEXT:    v_cndmask_b32_e64 v7, v10, s3, s4
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s8, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s9, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_or_b32_e32 v2, v6, v2
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, v8, s1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s8, s4
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s9, s4
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s1
 ; GFX11-NEXT:    v_or_b32_e32 v0, v4, v0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-NEXT:    v_or_b32_e32 v1, v5, v1
+; GFX11-NEXT:    v_or_b32_e32 v2, v6, v2
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-NEXT:    v_or_b32_e32 v3, v7, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.fshl.i128(i128 %lhs, i128 %rhs, i128 %amt)
@@ -6783,49 +6788,49 @@ define amdgpu_ps <4 x float> @v_fshl_i128_vss(i128 %lhs, i128 inreg %rhs, i128 i
 ; GFX10-LABEL: v_fshl_i128_vss:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_and_b32 s5, s4, 0x7f
-; GFX10-NEXT:    s_sub_i32 s6, s5, 64
 ; GFX10-NEXT:    s_sub_i32 s7, 64, s5
+; GFX10-NEXT:    s_sub_i32 s6, s5, 64
 ; GFX10-NEXT:    s_cmp_lt_u32 s5, 64
 ; GFX10-NEXT:    v_lshrrev_b64 v[4:5], s7, v[0:1]
+; GFX10-NEXT:    v_lshlrev_b64 v[6:7], s5, v[2:3]
 ; GFX10-NEXT:    s_cselect_b32 s8, 1, 0
 ; GFX10-NEXT:    s_cmp_eq_u32 s5, 0
-; GFX10-NEXT:    v_lshlrev_b64 v[6:7], s5, v[2:3]
-; GFX10-NEXT:    s_cselect_b32 s9, 1, 0
 ; GFX10-NEXT:    v_lshlrev_b64 v[8:9], s5, v[0:1]
-; GFX10-NEXT:    v_lshlrev_b64 v[0:1], s6, v[0:1]
-; GFX10-NEXT:    s_mov_b32 s6, 0
-; GFX10-NEXT:    s_lshr_b64 s[0:1], s[0:1], 1
-; GFX10-NEXT:    s_lshl_b32 s7, s2, 31
+; GFX10-NEXT:    s_cselect_b32 s9, 1, 0
 ; GFX10-NEXT:    s_and_b32 s5, 1, s8
-; GFX10-NEXT:    s_or_b64 s[0:1], s[0:1], s[6:7]
-; GFX10-NEXT:    s_andn2_b32 s6, 0x7f, s4
+; GFX10-NEXT:    v_lshlrev_b64 v[0:1], s6, v[0:1]
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
 ; GFX10-NEXT:    v_or_b32_e32 v4, v4, v6
 ; GFX10-NEXT:    v_or_b32_e32 v5, v5, v7
 ; GFX10-NEXT:    s_and_b32 s5, 1, s9
-; GFX10-NEXT:    s_lshr_b64 s[2:3], s[2:3], 1
-; GFX10-NEXT:    s_not_b32 s8, s4
-; GFX10-NEXT:    s_sub_i32 s10, s6, 64
-; GFX10-NEXT:    s_sub_i32 s7, 64, s6
-; GFX10-NEXT:    s_cmp_lt_u32 s6, 64
+; GFX10-NEXT:    s_mov_b32 s6, 0
 ; GFX10-NEXT:    v_cndmask_b32_e32 v6, 0, v8, vcc_lo
-; GFX10-NEXT:    s_cselect_b32 s11, 1, 0
-; GFX10-NEXT:    s_cmp_eq_u32 s6, 0
 ; GFX10-NEXT:    v_cndmask_b32_e32 v7, 0, v9, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc_lo
+; GFX10-NEXT:    s_lshr_b64 s[0:1], s[0:1], 1
+; GFX10-NEXT:    s_lshl_b32 s7, s2, 31
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
+; GFX10-NEXT:    s_andn2_b32 s5, 0x7f, s4
+; GFX10-NEXT:    s_or_b64 s[0:1], s[0:1], s[6:7]
+; GFX10-NEXT:    s_lshr_b64 s[2:3], s[2:3], 1
+; GFX10-NEXT:    s_not_b32 s8, s4
+; GFX10-NEXT:    s_sub_i32 s10, s5, 64
+; GFX10-NEXT:    s_sub_i32 s6, 64, s5
+; GFX10-NEXT:    s_cmp_lt_u32 s5, 64
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v0, v2, vcc_lo
+; GFX10-NEXT:    s_cselect_b32 s11, 1, 0
+; GFX10-NEXT:    s_cmp_eq_u32 s5, 0
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v1, v3, vcc_lo
 ; GFX10-NEXT:    s_cselect_b32 s12, 1, 0
 ; GFX10-NEXT:    s_lshr_b64 s[4:5], s[0:1], s8
-; GFX10-NEXT:    s_lshl_b64 s[6:7], s[2:3], s7
+; GFX10-NEXT:    s_lshl_b64 s[6:7], s[2:3], s6
 ; GFX10-NEXT:    s_lshr_b64 s[8:9], s[2:3], s8
 ; GFX10-NEXT:    s_or_b64 s[4:5], s[4:5], s[6:7]
 ; GFX10-NEXT:    s_lshr_b64 s[2:3], s[2:3], s10
 ; GFX10-NEXT:    s_cmp_lg_u32 s11, 0
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v0, v2, vcc_lo
 ; GFX10-NEXT:    s_cselect_b64 s[2:3], s[4:5], s[2:3]
 ; GFX10-NEXT:    s_cmp_lg_u32 s12, 0
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v1, v3, vcc_lo
 ; GFX10-NEXT:    s_cselect_b64 s[0:1], s[0:1], s[2:3]
 ; GFX10-NEXT:    s_cmp_lg_u32 s11, 0
 ; GFX10-NEXT:    v_or_b32_e32 v0, s0, v6
@@ -6839,44 +6844,45 @@ define amdgpu_ps <4 x float> @v_fshl_i128_vss(i128 %lhs, i128 inreg %rhs, i128 i
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_and_b32 s5, s4, 0x7f
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_sub_i32 s6, s5, 64
 ; GFX11-NEXT:    s_sub_i32 s7, 64, s5
+; GFX11-NEXT:    s_sub_i32 s6, s5, 64
 ; GFX11-NEXT:    s_cmp_lt_u32 s5, 64
 ; GFX11-NEXT:    v_lshrrev_b64 v[4:5], s7, v[0:1]
+; GFX11-NEXT:    v_lshlrev_b64 v[6:7], s5, v[2:3]
 ; GFX11-NEXT:    s_cselect_b32 s8, 1, 0
 ; GFX11-NEXT:    s_cmp_eq_u32 s5, 0
-; GFX11-NEXT:    v_lshlrev_b64 v[6:7], s5, v[2:3]
-; GFX11-NEXT:    s_cselect_b32 s9, 1, 0
 ; GFX11-NEXT:    v_lshlrev_b64 v[8:9], s5, v[0:1]
-; GFX11-NEXT:    v_lshlrev_b64 v[0:1], s6, v[0:1]
-; GFX11-NEXT:    s_mov_b32 s6, 0
-; GFX11-NEXT:    s_lshr_b64 s[0:1], s[0:1], 1
-; GFX11-NEXT:    s_lshl_b32 s7, s2, 31
+; GFX11-NEXT:    s_cselect_b32 s9, 1, 0
 ; GFX11-NEXT:    s_and_b32 s5, 1, s8
-; GFX11-NEXT:    s_or_b64 s[0:1], s[0:1], s[6:7]
-; GFX11-NEXT:    s_and_not1_b32 s6, 0x7f, s4
+; GFX11-NEXT:    v_lshlrev_b64 v[0:1], s6, v[0:1]
 ; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
 ; GFX11-NEXT:    v_or_b32_e32 v4, v4, v6
 ; GFX11-NEXT:    v_or_b32_e32 v5, v5, v7
 ; GFX11-NEXT:    s_and_b32 s5, 1, s9
-; GFX11-NEXT:    s_lshr_b64 s[2:3], s[2:3], 1
-; GFX11-NEXT:    s_not_b32 s8, s4
-; GFX11-NEXT:    s_sub_i32 s10, s6, 64
-; GFX11-NEXT:    s_sub_i32 s7, 64, s6
-; GFX11-NEXT:    s_cmp_lt_u32 s6, 64
+; GFX11-NEXT:    s_mov_b32 s6, 0
 ; GFX11-NEXT:    v_dual_cndmask_b32 v6, 0, v8 :: v_dual_cndmask_b32 v7, 0, v9
-; GFX11-NEXT:    s_cselect_b32 s11, 1, 0
-; GFX11-NEXT:    s_cmp_eq_u32 s6, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-NEXT:    v_dual_cndmask_b32 v0, v0, v4 :: v_dual_cndmask_b32 v1, v1, v5
+; GFX11-NEXT:    s_lshr_b64 s[0:1], s[0:1], 1
+; GFX11-NEXT:    s_lshl_b32 s7, s2, 31
 ; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
+; GFX11-NEXT:    s_and_not1_b32 s5, 0x7f, s4
+; GFX11-NEXT:    s_or_b64 s[0:1], s[0:1], s[6:7]
+; GFX11-NEXT:    s_lshr_b64 s[2:3], s[2:3], 1
+; GFX11-NEXT:    s_not_b32 s8, s4
+; GFX11-NEXT:    s_sub_i32 s10, s5, 64
+; GFX11-NEXT:    s_sub_i32 s6, 64, s5
+; GFX11-NEXT:    s_cmp_lt_u32 s5, 64
+; GFX11-NEXT:    v_dual_cndmask_b32 v2, v0, v2 :: v_dual_cndmask_b32 v3, v1, v3
+; GFX11-NEXT:    s_cselect_b32 s11, 1, 0
+; GFX11-NEXT:    s_cmp_eq_u32 s5, 0
 ; GFX11-NEXT:    s_cselect_b32 s12, 1, 0
 ; GFX11-NEXT:    s_lshr_b64 s[4:5], s[0:1], s8
-; GFX11-NEXT:    s_lshl_b64 s[6:7], s[2:3], s7
+; GFX11-NEXT:    s_lshl_b64 s[6:7], s[2:3], s6
 ; GFX11-NEXT:    s_lshr_b64 s[8:9], s[2:3], s8
 ; GFX11-NEXT:    s_or_b64 s[4:5], s[4:5], s[6:7]
 ; GFX11-NEXT:    s_lshr_b64 s[2:3], s[2:3], s10
 ; GFX11-NEXT:    s_cmp_lg_u32 s11, 0
-; GFX11-NEXT:    v_dual_cndmask_b32 v2, v0, v2 :: v_dual_cndmask_b32 v3, v1, v3
 ; GFX11-NEXT:    s_cselect_b64 s[2:3], s[4:5], s[2:3]
 ; GFX11-NEXT:    s_cmp_lg_u32 s12, 0
 ; GFX11-NEXT:    s_cselect_b64 s[0:1], s[0:1], s[2:3]
@@ -7741,85 +7747,85 @@ define <2 x i128> @v_fshl_v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %a
 ; GFX10-NEXT:    v_and_b32_e32 v27, 0x7f, v16
 ; GFX10-NEXT:    v_not_b32_e32 v21, v16
 ; GFX10-NEXT:    v_lshrrev_b64 v[8:9], 1, v[8:9]
-; GFX10-NEXT:    v_sub_nc_u32_e32 v17, 64, v27
+; GFX10-NEXT:    v_sub_nc_u32_e32 v18, 64, v27
 ; GFX10-NEXT:    v_and_b32_e32 v28, 0x7f, v21
-; GFX10-NEXT:    v_lshlrev_b64 v[18:19], v27, v[2:3]
+; GFX10-NEXT:    v_add_nc_u32_e32 v23, 0xffffffc0, v27
+; GFX10-NEXT:    v_lshlrev_b64 v[21:22], v27, v[2:3]
 ; GFX10-NEXT:    v_lshl_or_b32 v9, v10, 31, v9
+; GFX10-NEXT:    v_lshrrev_b64 v[18:19], v18, v[0:1]
 ; GFX10-NEXT:    v_lshrrev_b64 v[10:11], 1, v[10:11]
-; GFX10-NEXT:    v_lshrrev_b64 v[16:17], v17, v[0:1]
-; GFX10-NEXT:    v_add_nc_u32_e32 v29, 0xffffffc0, v27
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v25, 64, v28
-; GFX10-NEXT:    v_lshlrev_b64 v[21:22], v27, v[0:1]
-; GFX10-NEXT:    v_lshrrev_b64 v[23:24], v28, v[8:9]
+; GFX10-NEXT:    v_lshlrev_b64 v[16:17], v27, v[0:1]
+; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v23, v[0:1]
 ; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v27
-; GFX10-NEXT:    v_or_b32_e32 v18, v16, v18
-; GFX10-NEXT:    v_add_nc_u32_e32 v16, 0xffffffc0, v28
+; GFX10-NEXT:    v_or_b32_e32 v18, v18, v21
+; GFX10-NEXT:    v_add_nc_u32_e32 v21, 0xffffffc0, v28
+; GFX10-NEXT:    v_lshrrev_b64 v[23:24], v28, v[8:9]
 ; GFX10-NEXT:    v_lshlrev_b64 v[25:26], v25, v[10:11]
-; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v29, v[0:1]
-; GFX10-NEXT:    v_or_b32_e32 v19, v17, v19
-; GFX10-NEXT:    v_cndmask_b32_e32 v21, 0, v21, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b64 v[16:17], v16, v[10:11]
-; GFX10-NEXT:    v_cndmask_b32_e32 v22, 0, v22, vcc_lo
-; GFX10-NEXT:    v_or_b32_e32 v23, v23, v25
-; GFX10-NEXT:    v_cndmask_b32_e32 v18, v0, v18, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v19, v1, v19, vcc_lo
-; GFX10-NEXT:    v_or_b32_e32 v24, v24, v26
-; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v28
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s5, 64, v28
+; GFX10-NEXT:    v_cndmask_b32_e32 v29, v0, v18, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v0, v19, v22
+; GFX10-NEXT:    v_lshrrev_b64 v[18:19], v21, v[10:11]
+; GFX10-NEXT:    v_cndmask_b32_e32 v16, 0, v16, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v21, v23, v25
 ; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v27
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 0, v28
+; GFX10-NEXT:    v_cndmask_b32_e32 v17, 0, v17, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v23, v1, v0, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v28
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v18, v21, s5
+; GFX10-NEXT:    v_or_b32_e32 v22, v24, v26
 ; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v28, v[10:11]
-; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v23, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v17, v24, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v23, v19, v3, s4
-; GFX10-NEXT:    v_and_b32_e32 v24, 0x7f, v20
-; GFX10-NEXT:    v_cndmask_b32_e32 v25, 0, v1, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v16, v8, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v10, v9, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v29, v2, s4
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v18, v8, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v10, v19, v22, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v22, v23, v3, s4
+; GFX10-NEXT:    v_and_b32_e32 v23, 0x7f, v20
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, 0, v0, s5
+; GFX10-NEXT:    v_or_b32_e32 v0, v16, v2
 ; GFX10-NEXT:    v_not_b32_e32 v16, v20
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, 0, v0, vcc_lo
-; GFX10-NEXT:    v_or_b32_e32 v0, v21, v3
-; GFX10-NEXT:    v_or_b32_e32 v1, v22, v8
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v10, v9, vcc_lo
 ; GFX10-NEXT:    v_lshrrev_b64 v[8:9], 1, v[12:13]
-; GFX10-NEXT:    v_sub_nc_u32_e32 v3, 64, v24
-; GFX10-NEXT:    v_and_b32_e32 v22, 0x7f, v16
-; GFX10-NEXT:    v_or_b32_e32 v2, v2, v10
-; GFX10-NEXT:    v_lshlrev_b64 v[12:13], v24, v[6:7]
-; GFX10-NEXT:    v_lshlrev_b64 v[16:17], v24, v[4:5]
-; GFX10-NEXT:    v_lshrrev_b64 v[10:11], v3, v[4:5]
+; GFX10-NEXT:    v_sub_nc_u32_e32 v2, 64, v23
+; GFX10-NEXT:    v_cndmask_b32_e64 v25, 0, v1, s5
+; GFX10-NEXT:    v_and_b32_e32 v20, 0x7f, v16
+; GFX10-NEXT:    v_or_b32_e32 v1, v17, v3
+; GFX10-NEXT:    v_add_nc_u32_e32 v17, 0xffffffc0, v23
+; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v2, v[4:5]
+; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v23, v[6:7]
 ; GFX10-NEXT:    v_lshl_or_b32 v9, v14, 31, v9
 ; GFX10-NEXT:    v_lshrrev_b64 v[14:15], 1, v[14:15]
-; GFX10-NEXT:    v_sub_nc_u32_e32 v20, 64, v22
-; GFX10-NEXT:    v_add_nc_u32_e32 v3, 0xffffffc0, v24
-; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v24
-; GFX10-NEXT:    v_or_b32_e32 v12, v10, v12
-; GFX10-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v22
-; GFX10-NEXT:    v_lshrrev_b64 v[18:19], v22, v[8:9]
-; GFX10-NEXT:    v_lshlrev_b64 v[20:21], v20, v[14:15]
-; GFX10-NEXT:    v_lshlrev_b64 v[3:4], v3, v[4:5]
-; GFX10-NEXT:    v_or_b32_e32 v5, v11, v13
-; GFX10-NEXT:    v_lshrrev_b64 v[10:11], v10, v[14:15]
-; GFX10-NEXT:    v_cndmask_b32_e32 v13, 0, v16, vcc_lo
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s4, 64, v22
-; GFX10-NEXT:    v_or_b32_e32 v16, v18, v20
-; GFX10-NEXT:    v_or_b32_e32 v18, v19, v21
-; GFX10-NEXT:    v_cndmask_b32_e32 v12, v3, v12, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v4, v5, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b64 v[3:4], v22, v[14:15]
-; GFX10-NEXT:    v_cndmask_b32_e64 v10, v10, v16, s4
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 0, v22
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v24
-; GFX10-NEXT:    v_cndmask_b32_e64 v11, v11, v18, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, 0, v17, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v12, v6, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v10, v8, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v11, v9, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, 0, v3, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v10, 0, v4, s4
-; GFX10-NEXT:    v_or_b32_e32 v3, v23, v25
-; GFX10-NEXT:    v_or_b32_e32 v4, v13, v5
-; GFX10-NEXT:    v_or_b32_e32 v5, v14, v8
+; GFX10-NEXT:    v_sub_nc_u32_e32 v18, 64, v20
+; GFX10-NEXT:    v_lshlrev_b64 v[12:13], v23, v[4:5]
+; GFX10-NEXT:    v_lshlrev_b64 v[4:5], v17, v[4:5]
+; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v23
+; GFX10-NEXT:    v_or_b32_e32 v10, v2, v10
+; GFX10-NEXT:    v_add_nc_u32_e32 v26, 0xffffffc0, v20
+; GFX10-NEXT:    v_lshrrev_b64 v[16:17], v20, v[8:9]
+; GFX10-NEXT:    v_lshlrev_b64 v[18:19], v18, v[14:15]
+; GFX10-NEXT:    v_or_b32_e32 v2, v21, v24
+; GFX10-NEXT:    v_or_b32_e32 v11, v3, v11
+; GFX10-NEXT:    v_cndmask_b32_e32 v21, v4, v10, vcc_lo
+; GFX10-NEXT:    v_lshrrev_b64 v[3:4], v26, v[14:15]
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s5, 64, v20
+; GFX10-NEXT:    v_or_b32_e32 v10, v16, v18
+; GFX10-NEXT:    v_or_b32_e32 v16, v17, v19
+; GFX10-NEXT:    v_cndmask_b32_e32 v5, v5, v11, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v23
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v20
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v10, s5
+; GFX10-NEXT:    v_lshrrev_b64 v[10:11], v20, v[14:15]
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v16, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v12, 0, v12, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v13, 0, v13, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v21, v6, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v3, v8, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, v4, v9, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v9, 0, v10, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v10, 0, v11, s5
+; GFX10-NEXT:    v_or_b32_e32 v3, v22, v25
+; GFX10-NEXT:    v_or_b32_e32 v4, v12, v5
+; GFX10-NEXT:    v_or_b32_e32 v5, v13, v8
 ; GFX10-NEXT:    v_or_b32_e32 v6, v6, v9
 ; GFX10-NEXT:    v_or_b32_e32 v7, v7, v10
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
@@ -7830,92 +7836,93 @@ define <2 x i128> @v_fshl_v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %a
 ; GFX11-NEXT:    v_and_b32_e32 v27, 0x7f, v16
 ; GFX11-NEXT:    v_not_b32_e32 v21, v16
 ; GFX11-NEXT:    v_lshrrev_b64 v[8:9], 1, v[8:9]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v27, v[0:1]
+; GFX11-NEXT:    v_sub_nc_u32_e32 v18, 64, v27
 ; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v27
-; GFX11-NEXT:    v_and_b32_e32 v28, 0x7f, v21
-; GFX11-NEXT:    v_lshlrev_b64 v[21:22], v27, v[0:1]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_add_nc_u32_e32 v23, 0xffffffc0, v27
 ; GFX11-NEXT:    v_lshl_or_b32 v9, v10, 31, v9
 ; GFX11-NEXT:    v_lshrrev_b64 v[10:11], 1, v[10:11]
-; GFX11-NEXT:    v_cndmask_b32_e32 v22, 0, v22, vcc_lo
-; GFX11-NEXT:    v_sub_nc_u32_e32 v17, 64, v27
-; GFX11-NEXT:    v_lshlrev_b64 v[18:19], v27, v[2:3]
+; GFX11-NEXT:    v_lshrrev_b64 v[18:19], v18, v[0:1]
+; GFX11-NEXT:    v_cndmask_b32_e32 v16, 0, v16, vcc_lo
+; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v23, v[0:1]
+; GFX11-NEXT:    v_and_b32_e32 v28, 0x7f, v21
+; GFX11-NEXT:    v_lshlrev_b64 v[21:22], v27, v[2:3]
 ; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v27
-; GFX11-NEXT:    v_cndmask_b32_e32 v21, 0, v21, vcc_lo
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_lshrrev_b64 v[16:17], v17, v[0:1]
-; GFX11-NEXT:    v_or_b32_e32 v18, v16, v18
-; GFX11-NEXT:    v_add_nc_u32_e32 v29, 0xffffffc0, v27
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_or_b32_e32 v19, v17, v19
-; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v29, v[0:1]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_dual_cndmask_b32 v18, v0, v18 :: v_dual_cndmask_b32 v19, v1, v19
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_or_b32_e32 v18, v18, v21
+; GFX11-NEXT:    v_cndmask_b32_e32 v29, v0, v18, vcc_lo
 ; GFX11-NEXT:    v_sub_nc_u32_e32 v25, 64, v28
-; GFX11-NEXT:    v_add_nc_u32_e32 v16, 0xffffffc0, v28
+; GFX11-NEXT:    v_add_nc_u32_e32 v21, 0xffffffc0, v28
 ; GFX11-NEXT:    v_lshrrev_b64 v[23:24], v28, v[8:9]
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v28
-; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v28, v[10:11]
+; GFX11-NEXT:    v_or_b32_e32 v0, v19, v22
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v28
 ; GFX11-NEXT:    v_lshlrev_b64 v[25:26], v25, v[10:11]
-; GFX11-NEXT:    v_lshrrev_b64 v[16:17], v16, v[10:11]
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v28
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_or_b32_e32 v23, v23, v25
-; GFX11-NEXT:    v_or_b32_e32 v24, v24, v26
-; GFX11-NEXT:    v_dual_cndmask_b32 v25, 0, v1 :: v_dual_cndmask_b32 v16, v16, v23
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e32 v10, v17, v24, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v23, v19, v3, s0
-; GFX11-NEXT:    v_and_b32_e32 v24, 0x7f, v20
-; GFX11-NEXT:    v_cndmask_b32_e64 v3, v16, v8, s1
+; GFX11-NEXT:    v_lshrrev_b64 v[18:19], v21, v[10:11]
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v21, v23, v25
+; GFX11-NEXT:    v_cndmask_b32_e32 v23, v1, v0, vcc_lo
+; GFX11-NEXT:    v_or_b32_e32 v22, v24, v26
+; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v28, v[10:11]
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, v10, v9, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v18, v18, v21, s1
+; GFX11-NEXT:    v_cndmask_b32_e32 v17, 0, v17, vcc_lo
+; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v28
+; GFX11-NEXT:    v_cndmask_b32_e64 v21, v29, v2, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v10, v19, v22, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v22, v23, v3, s0
+; GFX11-NEXT:    v_and_b32_e32 v23, 0x7f, v20
+; GFX11-NEXT:    v_cndmask_b32_e32 v2, v18, v8, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v24, 0, v0, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v25, 0, v1, s1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v23
+; GFX11-NEXT:    v_or_b32_e32 v0, v16, v2
 ; GFX11-NEXT:    v_not_b32_e32 v16, v20
-; GFX11-NEXT:    v_cndmask_b32_e32 v10, 0, v0, vcc_lo
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v24
-; GFX11-NEXT:    v_or_b32_e32 v0, v21, v3
-; GFX11-NEXT:    v_or_b32_e32 v1, v22, v8
+; GFX11-NEXT:    v_cndmask_b32_e32 v3, v10, v9, vcc_lo
 ; GFX11-NEXT:    v_lshrrev_b64 v[8:9], 1, v[12:13]
-; GFX11-NEXT:    v_sub_nc_u32_e32 v3, 64, v24
-; GFX11-NEXT:    v_and_b32_e32 v22, 0x7f, v16
-; GFX11-NEXT:    v_or_b32_e32 v2, v2, v10
-; GFX11-NEXT:    v_lshlrev_b64 v[12:13], v24, v[6:7]
-; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v24, v[4:5]
-; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v3, v[4:5]
+; GFX11-NEXT:    v_sub_nc_u32_e32 v2, 64, v23
+; GFX11-NEXT:    v_lshlrev_b64 v[12:13], v23, v[4:5]
+; GFX11-NEXT:    v_and_b32_e32 v20, 0x7f, v16
+; GFX11-NEXT:    v_or_b32_e32 v1, v17, v3
+; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v23, v[6:7]
+; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v2, v[4:5]
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v23
+; GFX11-NEXT:    v_add_nc_u32_e32 v17, 0xffffffc0, v23
 ; GFX11-NEXT:    v_lshl_or_b32 v9, v14, 31, v9
 ; GFX11-NEXT:    v_lshrrev_b64 v[14:15], 1, v[14:15]
-; GFX11-NEXT:    v_sub_nc_u32_e32 v20, 64, v22
-; GFX11-NEXT:    v_add_nc_u32_e32 v3, 0xffffffc0, v24
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v22
-; GFX11-NEXT:    v_or_b32_e32 v12, v10, v12
-; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v22
-; GFX11-NEXT:    v_lshrrev_b64 v[18:19], v22, v[8:9]
-; GFX11-NEXT:    v_lshlrev_b64 v[20:21], v20, v[14:15]
-; GFX11-NEXT:    v_lshlrev_b64 v[3:4], v3, v[4:5]
-; GFX11-NEXT:    v_or_b32_e32 v5, v11, v13
-; GFX11-NEXT:    v_cndmask_b32_e32 v13, 0, v16, vcc_lo
-; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v10, v[14:15]
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v22
-; GFX11-NEXT:    v_or_b32_e32 v16, v18, v20
-; GFX11-NEXT:    v_or_b32_e32 v18, v19, v21
-; GFX11-NEXT:    v_dual_cndmask_b32 v12, v3, v12 :: v_dual_cndmask_b32 v5, v4, v5
-; GFX11-NEXT:    v_lshrrev_b64 v[3:4], v22, v[14:15]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-NEXT:    v_cndmask_b32_e64 v10, v10, v16, s0
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v24
-; GFX11-NEXT:    v_cndmask_b32_e64 v11, v11, v18, s0
-; GFX11-NEXT:    v_cndmask_b32_e32 v14, 0, v17, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, v12, v6, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v5, v10, v8, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, v11, v9, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v9, 0, v3, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v10, 0, v4, s0
-; GFX11-NEXT:    v_or_b32_e32 v3, v23, v25
-; GFX11-NEXT:    v_or_b32_e32 v4, v13, v5
-; GFX11-NEXT:    v_or_b32_e32 v5, v14, v8
+; GFX11-NEXT:    v_sub_nc_u32_e32 v18, 64, v20
+; GFX11-NEXT:    v_cndmask_b32_e32 v12, 0, v12, vcc_lo
+; GFX11-NEXT:    v_lshlrev_b64 v[4:5], v17, v[4:5]
+; GFX11-NEXT:    v_or_b32_e32 v10, v2, v10
+; GFX11-NEXT:    v_add_nc_u32_e32 v26, 0xffffffc0, v20
+; GFX11-NEXT:    v_lshrrev_b64 v[16:17], v20, v[8:9]
+; GFX11-NEXT:    v_lshlrev_b64 v[18:19], v18, v[14:15]
+; GFX11-NEXT:    v_or_b32_e32 v2, v21, v24
+; GFX11-NEXT:    v_or_b32_e32 v11, v3, v11
+; GFX11-NEXT:    v_cndmask_b32_e32 v21, v4, v10, vcc_lo
+; GFX11-NEXT:    v_lshrrev_b64 v[3:4], v26, v[14:15]
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v20
+; GFX11-NEXT:    v_or_b32_e32 v10, v16, v18
+; GFX11-NEXT:    v_or_b32_e32 v16, v17, v19
+; GFX11-NEXT:    v_cndmask_b32_e32 v5, v5, v11, vcc_lo
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v20
+; GFX11-NEXT:    v_cndmask_b32_e32 v13, 0, v13, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v10, s1
+; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v20, v[14:15]
+; GFX11-NEXT:    v_cndmask_b32_e64 v4, v4, v16, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v21, v6, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v5, v3, v8, s2
+; GFX11-NEXT:    v_or_b32_e32 v3, v22, v25
+; GFX11-NEXT:    v_cndmask_b32_e64 v8, v4, v9, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v9, 0, v10, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v10, 0, v11, s1
+; GFX11-NEXT:    v_or_b32_e32 v4, v12, v5
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v5, v13, v8
 ; GFX11-NEXT:    v_or_b32_e32 v6, v6, v9
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-NEXT:    v_or_b32_e32 v7, v7, v10
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %result = call <2 x i128> @llvm.fshl.v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %amt)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll
index 2e8c918e4c67e..a22a858dc9f2e 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll
@@ -6019,46 +6019,46 @@ define i128 @v_fshr_i128(i128 %lhs, i128 %rhs, i128 %amt) {
 ; GFX10-NEXT:    v_not_b32_e32 v9, v8
 ; GFX10-NEXT:    v_lshlrev_b64 v[2:3], 1, v[2:3]
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v10, 31, v1
-; GFX10-NEXT:    v_and_b32_e32 v19, 0x7f, v8
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], 1, v[0:1]
-; GFX10-NEXT:    v_and_b32_e32 v18, 0x7f, v9
+; GFX10-NEXT:    v_and_b32_e32 v21, 0x7f, v8
+; GFX10-NEXT:    v_and_b32_e32 v20, 0x7f, v9
 ; GFX10-NEXT:    v_or_b32_e32 v2, v2, v10
-; GFX10-NEXT:    v_sub_nc_u32_e32 v16, 64, v19
-; GFX10-NEXT:    v_add_nc_u32_e32 v21, 0xffffffc0, v19
-; GFX10-NEXT:    v_sub_nc_u32_e32 v10, 64, v18
-; GFX10-NEXT:    v_add_nc_u32_e32 v20, 0xffffffc0, v18
-; GFX10-NEXT:    v_lshlrev_b64 v[8:9], v18, v[2:3]
-; GFX10-NEXT:    v_lshrrev_b64 v[12:13], v19, v[4:5]
+; GFX10-NEXT:    v_sub_nc_u32_e32 v16, 64, v21
+; GFX10-NEXT:    v_sub_nc_u32_e32 v12, 64, v20
+; GFX10-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v20
+; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v20, v[2:3]
+; GFX10-NEXT:    v_lshlrev_b64 v[8:9], v20, v[0:1]
+; GFX10-NEXT:    v_add_nc_u32_e32 v18, 0xffffffc0, v21
+; GFX10-NEXT:    v_lshrrev_b64 v[12:13], v12, v[0:1]
+; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v14, v[0:1]
+; GFX10-NEXT:    v_lshrrev_b64 v[14:15], v21, v[4:5]
 ; GFX10-NEXT:    v_lshlrev_b64 v[16:17], v16, v[6:7]
-; GFX10-NEXT:    v_lshrrev_b64 v[10:11], v10, v[0:1]
-; GFX10-NEXT:    v_lshlrev_b64 v[14:15], v18, v[0:1]
-; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v20, v[0:1]
-; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v18
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s4, 64, v19
-; GFX10-NEXT:    v_or_b32_e32 v12, v12, v16
-; GFX10-NEXT:    v_or_b32_e32 v10, v10, v8
-; GFX10-NEXT:    v_or_b32_e32 v11, v11, v9
-; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v21, v[6:7]
-; GFX10-NEXT:    v_or_b32_e32 v13, v13, v17
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 0, v19
+; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v20
+; GFX10-NEXT:    v_lshrrev_b64 v[18:19], v18, v[6:7]
+; GFX10-NEXT:    v_or_b32_e32 v10, v12, v10
+; GFX10-NEXT:    v_or_b32_e32 v11, v13, v11
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s5, 64, v21
+; GFX10-NEXT:    v_or_b32_e32 v12, v15, v17
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v20
 ; GFX10-NEXT:    v_cndmask_b32_e32 v10, v0, v10, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v0, v14, v16
 ; GFX10-NEXT:    v_cndmask_b32_e32 v11, v1, v11, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v19, v[6:7]
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v8, v12, s4
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v18
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v9, v13, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, 0, v14, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, 0, v15, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, v8, v4, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v6, v5, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, 0, v0, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, 0, v1, s4
-; GFX10-NEXT:    v_or_b32_e32 v0, v14, v4
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v21
+; GFX10-NEXT:    v_cndmask_b32_e32 v8, 0, v8, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v13, v18, v0, s5
+; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v21, v[6:7]
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v19, v12, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v7, 0, v9, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v13, v4, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v6, v5, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, 0, v0, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v9, 0, v1, s5
+; GFX10-NEXT:    v_or_b32_e32 v0, v8, v4
 ; GFX10-NEXT:    v_or_b32_e32 v1, v7, v5
 ; GFX10-NEXT:    v_or_b32_e32 v2, v2, v6
-; GFX10-NEXT:    v_or_b32_e32 v3, v3, v8
+; GFX10-NEXT:    v_or_b32_e32 v3, v3, v9
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_fshr_i128:
@@ -6069,49 +6069,54 @@ define i128 @v_fshr_i128(i128 %lhs, i128 %rhs, i128 %amt) {
 ; GFX11-NEXT:    v_lshrrev_b32_e32 v10, 31, v1
 ; GFX11-NEXT:    v_lshlrev_b64 v[0:1], 1, v[0:1]
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-NEXT:    v_and_b32_e32 v18, 0x7f, v9
+; GFX11-NEXT:    v_and_b32_e32 v20, 0x7f, v9
 ; GFX11-NEXT:    v_or_b32_e32 v2, v2, v10
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_sub_nc_u32_e32 v12, 64, v20
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v20
+; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v20, v[2:3]
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_lshrrev_b64 v[12:13], v12, v[0:1]
+; GFX11-NEXT:    v_or_b32_e32 v10, v12, v10
+; GFX11-NEXT:    v_and_b32_e32 v21, 0x7f, v8
+; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v20, v[0:1]
+; GFX11-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v20
+; GFX11-NEXT:    v_or_b32_e32 v11, v13, v11
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_sub_nc_u32_e32 v10, 64, v18
-; GFX11-NEXT:    v_lshlrev_b64 v[14:15], v18, v[0:1]
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v18
-; GFX11-NEXT:    v_and_b32_e32 v19, 0x7f, v8
-; GFX11-NEXT:    v_add_nc_u32_e32 v20, 0xffffffc0, v18
-; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v18, v[2:3]
-; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v10, v[0:1]
-; GFX11-NEXT:    v_cndmask_b32_e32 v14, 0, v14, vcc_lo
-; GFX11-NEXT:    v_sub_nc_u32_e32 v16, 64, v19
-; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v20, v[0:1]
-; GFX11-NEXT:    v_lshrrev_b64 v[12:13], v19, v[4:5]
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v19
-; GFX11-NEXT:    v_or_b32_e32 v10, v10, v8
-; GFX11-NEXT:    v_add_nc_u32_e32 v21, 0xffffffc0, v19
-; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v16, v[6:7]
-; GFX11-NEXT:    v_or_b32_e32 v11, v11, v9
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v19
+; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v14, v[0:1]
+; GFX11-NEXT:    v_cndmask_b32_e32 v8, 0, v8, vcc_lo
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-NEXT:    v_cndmask_b32_e32 v10, v0, v10, vcc_lo
-; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v21, v[6:7]
-; GFX11-NEXT:    v_or_b32_e32 v12, v12, v16
-; GFX11-NEXT:    v_or_b32_e32 v13, v13, v17
+; GFX11-NEXT:    v_sub_nc_u32_e32 v16, 64, v21
+; GFX11-NEXT:    v_add_nc_u32_e32 v18, 0xffffffc0, v21
+; GFX11-NEXT:    v_lshrrev_b64 v[14:15], v21, v[4:5]
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v21
 ; GFX11-NEXT:    v_cndmask_b32_e32 v11, v1, v11, vcc_lo
-; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v19, v[6:7]
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v18
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, v8, v12, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, v9, v13, s0
-; GFX11-NEXT:    v_cndmask_b32_e32 v7, 0, v15, vcc_lo
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v4, v8, v4, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v5, v6, v5, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, 0, v0, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, 0, v1, s0
+; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v16, v[6:7]
+; GFX11-NEXT:    v_lshrrev_b64 v[18:19], v18, v[6:7]
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v20
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v21
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v0, v14, v16
+; GFX11-NEXT:    v_or_b32_e32 v12, v15, v17
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v13, v18, v0, s1
+; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v21, v[6:7]
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v19, v12, s1
+; GFX11-NEXT:    v_cndmask_b32_e32 v7, 0, v9, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v4, v13, v4, s2
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v5, v6, v5, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, 0, v0, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v9, 0, v1, s1
+; GFX11-NEXT:    v_or_b32_e32 v0, v8, v4
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_or_b32_e32 v0, v14, v4
 ; GFX11-NEXT:    v_or_b32_e32 v1, v7, v5
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-NEXT:    v_or_b32_e32 v2, v2, v6
-; GFX11-NEXT:    v_or_b32_e32 v3, v3, v8
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v3, v3, v9
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %result = call i128 @llvm.fshr.i128(i128 %lhs, i128 %rhs, i128 %amt)
   ret i128 %result
@@ -6279,100 +6284,99 @@ define amdgpu_ps <4 x float> @v_fshr_i128_ssv(i128 inreg %lhs, i128 inreg %rhs,
 ; GFX10-LABEL: v_fshr_i128_ssv:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    v_not_b32_e32 v1, v0
-; GFX10-NEXT:    v_and_b32_e32 v13, 0x7f, v0
 ; GFX10-NEXT:    s_mov_b32 s9, 0
 ; GFX10-NEXT:    s_lshl_b64 s[2:3], s[2:3], 1
 ; GFX10-NEXT:    s_lshr_b32 s8, s1, 31
+; GFX10-NEXT:    v_and_b32_e32 v13, 0x7f, v0
 ; GFX10-NEXT:    v_and_b32_e32 v12, 0x7f, v1
-; GFX10-NEXT:    v_sub_nc_u32_e32 v8, 64, v13
-; GFX10-NEXT:    s_lshl_b64 s[0:1], s[0:1], 1
+; GFX10-NEXT:    s_lshl_b64 s[10:11], s[0:1], 1
 ; GFX10-NEXT:    s_or_b64 s[8:9], s[2:3], s[8:9]
-; GFX10-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v13
+; GFX10-NEXT:    v_sub_nc_u32_e32 v10, 64, v13
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, 64, v12
+; GFX10-NEXT:    v_add_nc_u32_e32 v6, 0xffffffc0, v12
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v12, s[8:9]
-; GFX10-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v12
-; GFX10-NEXT:    v_lshrrev_b64 v[4:5], v13, s[4:5]
-; GFX10-NEXT:    v_lshlrev_b64 v[8:9], v8, s[6:7]
-; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v2, s[0:1]
+; GFX10-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v13
 ; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v12
-; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v10, s[0:1]
-; GFX10-NEXT:    v_lshlrev_b64 v[6:7], v12, s[0:1]
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s0, 64, v13
-; GFX10-NEXT:    v_or_b32_e32 v4, v4, v8
-; GFX10-NEXT:    v_or_b32_e32 v2, v2, v0
-; GFX10-NEXT:    v_or_b32_e32 v3, v3, v1
+; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v2, s[10:11]
+; GFX10-NEXT:    v_lshlrev_b64 v[6:7], v6, s[10:11]
+; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v13, s[4:5]
+; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v10, s[6:7]
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s1, 64, v13
+; GFX10-NEXT:    v_lshlrev_b64 v[4:5], v12, s[10:11]
+; GFX10-NEXT:    v_or_b32_e32 v0, v2, v0
+; GFX10-NEXT:    v_or_b32_e32 v2, v3, v1
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s0, 0, v12
+; GFX10-NEXT:    v_or_b32_e32 v3, v8, v10
+; GFX10-NEXT:    v_or_b32_e32 v8, v9, v11
+; GFX10-NEXT:    v_cndmask_b32_e32 v6, v6, v0, vcc_lo
 ; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v14, s[6:7]
-; GFX10-NEXT:    v_or_b32_e32 v5, v5, v9
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s1, 0, v13
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v10, v2, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v11, v3, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v7, v7, v2, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s2, 0, v13
+; GFX10-NEXT:    v_cndmask_b32_e32 v4, 0, v4, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v5, 0, v5, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, s8, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v3, s1
 ; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v13, s[6:7]
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s2, 0, v12
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v5, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, 0, v6, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, 0, v7, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s1
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v8, s8, s2
-; GFX10-NEXT:    v_cndmask_b32_e64 v7, v10, s9, s2
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s1
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s0
-; GFX10-NEXT:    v_or_b32_e32 v0, v6, v0
-; GFX10-NEXT:    v_or_b32_e32 v1, v4, v1
-; GFX10-NEXT:    v_or_b32_e32 v2, v5, v2
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v8, s1
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, s9, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s2
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s2
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s1
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s1
+; GFX10-NEXT:    v_or_b32_e32 v0, v4, v0
+; GFX10-NEXT:    v_or_b32_e32 v1, v5, v1
+; GFX10-NEXT:    v_or_b32_e32 v2, v6, v2
 ; GFX10-NEXT:    v_or_b32_e32 v3, v7, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: v_fshr_i128_ssv:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    v_not_b32_e32 v1, v0
-; GFX11-NEXT:    s_lshr_b32 s8, s1, 31
-; GFX11-NEXT:    s_lshl_b64 s[0:1], s[0:1], 1
 ; GFX11-NEXT:    s_mov_b32 s9, 0
 ; GFX11-NEXT:    s_lshl_b64 s[2:3], s[2:3], 1
+; GFX11-NEXT:    s_lshr_b32 s8, s1, 31
+; GFX11-NEXT:    v_and_b32_e32 v13, 0x7f, v0
 ; GFX11-NEXT:    v_and_b32_e32 v12, 0x7f, v1
+; GFX11-NEXT:    s_lshl_b64 s[10:11], s[0:1], 1
 ; GFX11-NEXT:    s_or_b64 s[8:9], s[2:3], s[8:9]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_lshlrev_b64 v[6:7], v12, s[0:1]
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v12
-; GFX11-NEXT:    v_and_b32_e32 v13, 0x7f, v0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_sub_nc_u32_e32 v10, 64, v13
 ; GFX11-NEXT:    v_sub_nc_u32_e32 v2, 64, v12
 ; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v12, s[8:9]
-; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v12
-; GFX11-NEXT:    v_cndmask_b32_e32 v6, 0, v6, vcc_lo
-; GFX11-NEXT:    v_sub_nc_u32_e32 v8, 64, v13
-; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v2, s[0:1]
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v12
+; GFX11-NEXT:    v_add_nc_u32_e32 v6, 0xffffffc0, v12
 ; GFX11-NEXT:    v_add_nc_u32_e32 v14, 0xffffffc0, v13
-; GFX11-NEXT:    v_lshrrev_b64 v[4:5], v13, s[4:5]
-; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v10, s[0:1]
-; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v8, s[6:7]
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v13
-; GFX11-NEXT:    v_or_b32_e32 v2, v2, v0
-; GFX11-NEXT:    v_or_b32_e32 v3, v3, v1
+; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v2, s[10:11]
+; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v13, s[4:5]
+; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v10, s[6:7]
+; GFX11-NEXT:    v_lshlrev_b64 v[6:7], v6, s[10:11]
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v13
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v13
+; GFX11-NEXT:    v_or_b32_e32 v0, v2, v0
+; GFX11-NEXT:    v_or_b32_e32 v2, v3, v1
+; GFX11-NEXT:    v_or_b32_e32 v3, v8, v10
+; GFX11-NEXT:    v_or_b32_e32 v8, v9, v11
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e32 v6, v6, v0, vcc_lo
 ; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v14, s[6:7]
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v13
-; GFX11-NEXT:    v_or_b32_e32 v4, v4, v8
-; GFX11-NEXT:    v_or_b32_e32 v5, v5, v9
-; GFX11-NEXT:    v_cndmask_b32_e32 v8, v10, v2, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e32 v10, v11, v3, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e32 v7, v7, v2, vcc_lo
+; GFX11-NEXT:    v_lshlrev_b64 v[4:5], v12, s[10:11]
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v12
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v3, s1
 ; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v13, s[6:7]
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v12
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, v5, s0
-; GFX11-NEXT:    v_cndmask_b32_e32 v4, 0, v7, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, v8, s1
+; GFX11-NEXT:    v_dual_cndmask_b32 v4, 0, v4 :: v_dual_cndmask_b32 v5, 0, v5
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, s8, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v7, v7, s9, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s1
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v5, v8, s8, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v7, v10, s9, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v3, 0, v3, s0
-; GFX11-NEXT:    v_or_b32_e32 v0, v6, v0
+; GFX11-NEXT:    v_or_b32_e32 v0, v4, v0
+; GFX11-NEXT:    v_or_b32_e32 v1, v5, v1
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_or_b32_e32 v1, v4, v1
-; GFX11-NEXT:    v_or_b32_e32 v2, v5, v2
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-NEXT:    v_or_b32_e32 v2, v6, v2
 ; GFX11-NEXT:    v_or_b32_e32 v3, v7, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.fshr.i128(i128 %lhs, i128 %rhs, i128 %amt)
@@ -6824,44 +6828,44 @@ define amdgpu_ps <4 x float> @v_fshr_i128_vss(i128 %lhs, i128 inreg %rhs, i128 i
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v4, 31, v1
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], 1, v[0:1]
 ; GFX10-NEXT:    s_andn2_b32 s5, 0x7f, s4
-; GFX10-NEXT:    s_sub_i32 s6, s5, 64
-; GFX10-NEXT:    v_or_b32_e32 v2, v2, v4
 ; GFX10-NEXT:    s_sub_i32 s7, 64, s5
+; GFX10-NEXT:    v_or_b32_e32 v2, v2, v4
+; GFX10-NEXT:    s_sub_i32 s6, s5, 64
 ; GFX10-NEXT:    s_cmp_lt_u32 s5, 64
 ; GFX10-NEXT:    v_lshrrev_b64 v[4:5], s7, v[0:1]
 ; GFX10-NEXT:    s_cselect_b32 s8, 1, 0
-; GFX10-NEXT:    s_cmp_eq_u32 s5, 0
 ; GFX10-NEXT:    v_lshlrev_b64 v[6:7], s5, v[2:3]
-; GFX10-NEXT:    s_cselect_b32 s9, 1, 0
+; GFX10-NEXT:    s_cmp_eq_u32 s5, 0
 ; GFX10-NEXT:    v_lshlrev_b64 v[8:9], s5, v[0:1]
+; GFX10-NEXT:    s_cselect_b32 s9, 1, 0
 ; GFX10-NEXT:    s_and_b32 s5, 1, s8
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], s6, v[0:1]
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
-; GFX10-NEXT:    s_and_b32 s5, s4, 0x7f
 ; GFX10-NEXT:    v_or_b32_e32 v4, v4, v6
 ; GFX10-NEXT:    v_or_b32_e32 v5, v5, v7
-; GFX10-NEXT:    s_and_b32 s6, 1, s9
-; GFX10-NEXT:    s_sub_i32 s10, s5, 64
-; GFX10-NEXT:    s_sub_i32 s8, 64, s5
-; GFX10-NEXT:    s_cmp_lt_u32 s5, 64
+; GFX10-NEXT:    s_and_b32 s5, 1, s9
+; GFX10-NEXT:    s_and_b32 s6, s4, 0x7f
 ; GFX10-NEXT:    v_cndmask_b32_e32 v6, 0, v8, vcc_lo
-; GFX10-NEXT:    s_cselect_b32 s11, 1, 0
-; GFX10-NEXT:    s_cmp_eq_u32 s5, 0
 ; GFX10-NEXT:    v_cndmask_b32_e32 v7, 0, v9, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc_lo
-; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s6
+; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
+; GFX10-NEXT:    s_sub_i32 s10, s6, 64
+; GFX10-NEXT:    s_sub_i32 s5, 64, s6
+; GFX10-NEXT:    s_cmp_lt_u32 s6, 64
+; GFX10-NEXT:    s_cselect_b32 s11, 1, 0
+; GFX10-NEXT:    s_cmp_eq_u32 s6, 0
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v0, v2, vcc_lo
 ; GFX10-NEXT:    s_cselect_b32 s12, 1, 0
 ; GFX10-NEXT:    s_lshr_b64 s[6:7], s[0:1], s4
-; GFX10-NEXT:    s_lshl_b64 s[8:9], s[2:3], s8
+; GFX10-NEXT:    s_lshl_b64 s[8:9], s[2:3], s5
 ; GFX10-NEXT:    s_lshr_b64 s[4:5], s[2:3], s4
 ; GFX10-NEXT:    s_or_b64 s[6:7], s[6:7], s[8:9]
 ; GFX10-NEXT:    s_lshr_b64 s[2:3], s[2:3], s10
 ; GFX10-NEXT:    s_cmp_lg_u32 s11, 0
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v0, v2, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v1, v3, vcc_lo
 ; GFX10-NEXT:    s_cselect_b64 s[2:3], s[6:7], s[2:3]
 ; GFX10-NEXT:    s_cmp_lg_u32 s12, 0
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v1, v3, vcc_lo
 ; GFX10-NEXT:    s_cselect_b64 s[0:1], s[0:1], s[2:3]
 ; GFX10-NEXT:    s_cmp_lg_u32 s11, 0
 ; GFX10-NEXT:    v_or_b32_e32 v0, s0, v6
@@ -6878,39 +6882,40 @@ define amdgpu_ps <4 x float> @v_fshr_i128_vss(i128 %lhs, i128 inreg %rhs, i128 i
 ; GFX11-NEXT:    v_lshlrev_b64 v[0:1], 1, v[0:1]
 ; GFX11-NEXT:    s_and_not1_b32 s5, 0x7f, s4
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    s_sub_i32 s6, s5, 64
-; GFX11-NEXT:    v_or_b32_e32 v2, v2, v4
 ; GFX11-NEXT:    s_sub_i32 s7, 64, s5
+; GFX11-NEXT:    v_or_b32_e32 v2, v2, v4
+; GFX11-NEXT:    s_sub_i32 s6, s5, 64
 ; GFX11-NEXT:    s_cmp_lt_u32 s5, 64
 ; GFX11-NEXT:    v_lshrrev_b64 v[4:5], s7, v[0:1]
 ; GFX11-NEXT:    s_cselect_b32 s8, 1, 0
-; GFX11-NEXT:    s_cmp_eq_u32 s5, 0
 ; GFX11-NEXT:    v_lshlrev_b64 v[6:7], s5, v[2:3]
-; GFX11-NEXT:    s_cselect_b32 s9, 1, 0
+; GFX11-NEXT:    s_cmp_eq_u32 s5, 0
 ; GFX11-NEXT:    v_lshlrev_b64 v[8:9], s5, v[0:1]
+; GFX11-NEXT:    s_cselect_b32 s9, 1, 0
 ; GFX11-NEXT:    s_and_b32 s5, 1, s8
 ; GFX11-NEXT:    v_lshlrev_b64 v[0:1], s6, v[0:1]
 ; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
-; GFX11-NEXT:    s_and_b32 s5, s4, 0x7f
 ; GFX11-NEXT:    v_or_b32_e32 v4, v4, v6
 ; GFX11-NEXT:    v_or_b32_e32 v5, v5, v7
-; GFX11-NEXT:    s_and_b32 s6, 1, s9
-; GFX11-NEXT:    s_sub_i32 s10, s5, 64
-; GFX11-NEXT:    s_sub_i32 s8, 64, s5
-; GFX11-NEXT:    s_cmp_lt_u32 s5, 64
+; GFX11-NEXT:    s_and_b32 s5, 1, s9
+; GFX11-NEXT:    s_and_b32 s6, s4, 0x7f
 ; GFX11-NEXT:    v_dual_cndmask_b32 v6, 0, v8 :: v_dual_cndmask_b32 v7, 0, v9
-; GFX11-NEXT:    s_cselect_b32 s11, 1, 0
-; GFX11-NEXT:    s_cmp_eq_u32 s5, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-NEXT:    v_dual_cndmask_b32 v0, v0, v4 :: v_dual_cndmask_b32 v1, v1, v5
-; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s6
+; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
+; GFX11-NEXT:    s_sub_i32 s10, s6, 64
+; GFX11-NEXT:    s_sub_i32 s5, 64, s6
+; GFX11-NEXT:    s_cmp_lt_u32 s6, 64
+; GFX11-NEXT:    s_cselect_b32 s11, 1, 0
+; GFX11-NEXT:    s_cmp_eq_u32 s6, 0
+; GFX11-NEXT:    v_dual_cndmask_b32 v2, v0, v2 :: v_dual_cndmask_b32 v3, v1, v3
 ; GFX11-NEXT:    s_cselect_b32 s12, 1, 0
 ; GFX11-NEXT:    s_lshr_b64 s[6:7], s[0:1], s4
-; GFX11-NEXT:    s_lshl_b64 s[8:9], s[2:3], s8
+; GFX11-NEXT:    s_lshl_b64 s[8:9], s[2:3], s5
 ; GFX11-NEXT:    s_lshr_b64 s[4:5], s[2:3], s4
 ; GFX11-NEXT:    s_or_b64 s[6:7], s[6:7], s[8:9]
 ; GFX11-NEXT:    s_lshr_b64 s[2:3], s[2:3], s10
 ; GFX11-NEXT:    s_cmp_lg_u32 s11, 0
-; GFX11-NEXT:    v_dual_cndmask_b32 v2, v0, v2 :: v_dual_cndmask_b32 v3, v1, v3
 ; GFX11-NEXT:    s_cselect_b64 s[2:3], s[6:7], s[2:3]
 ; GFX11-NEXT:    s_cmp_lg_u32 s12, 0
 ; GFX11-NEXT:    s_cselect_b64 s[0:1], s[0:1], s[2:3]
@@ -7789,7 +7794,6 @@ define <2 x i128> @v_fshr_v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %a
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v17, 31, v1
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], 1, v[0:1]
 ; GFX10-NEXT:    v_add_nc_u32_e32 v27, 0xffffffc0, v26
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s4, 64, v26
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v18, 64, v25
 ; GFX10-NEXT:    v_or_b32_e32 v2, v2, v17
 ; GFX10-NEXT:    v_add_nc_u32_e32 v19, 0xffffffc0, v25
@@ -7798,6 +7802,7 @@ define <2 x i128> @v_fshr_v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %a
 ; GFX10-NEXT:    v_lshrrev_b64 v[17:18], v18, v[0:1]
 ; GFX10-NEXT:    v_lshlrev_b64 v[21:22], v25, v[2:3]
 ; GFX10-NEXT:    v_lshlrev_b64 v[0:1], v19, v[0:1]
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v25
 ; GFX10-NEXT:    v_cndmask_b32_e32 v23, 0, v23, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v24, 0, v24, vcc_lo
 ; GFX10-NEXT:    v_or_b32_e32 v22, v18, v22
@@ -7808,65 +7813,65 @@ define <2 x i128> @v_fshr_v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %a
 ; GFX10-NEXT:    v_lshlrev_b64 v[18:19], v18, v[10:11]
 ; GFX10-NEXT:    v_cndmask_b32_e32 v21, v0, v21, vcc_lo
 ; GFX10-NEXT:    v_lshrrev_b64 v[0:1], v27, v[10:11]
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v25
+; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v26
+; GFX10-NEXT:    v_cndmask_b32_e64 v22, v22, v3, s4
 ; GFX10-NEXT:    v_or_b32_e32 v16, v16, v18
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v21, v2, s4
 ; GFX10-NEXT:    v_or_b32_e32 v17, v17, v19
-; GFX10-NEXT:    v_cndmask_b32_e32 v18, v21, v2, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v22, v22, v3, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v26
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v16, s4
-; GFX10-NEXT:    v_not_b32_e32 v16, v20
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v17, s4
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v26
 ; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v26, v[10:11]
-; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v8, vcc_lo
-; GFX10-NEXT:    v_and_b32_e32 v25, 0x7f, v16
-; GFX10-NEXT:    v_lshrrev_b32_e32 v8, 31, v5
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v16, vcc_lo
+; GFX10-NEXT:    v_not_b32_e32 v16, v20
+; GFX10-NEXT:    v_lshrrev_b32_e32 v10, 31, v5
 ; GFX10-NEXT:    v_lshlrev_b64 v[4:5], 1, v[4:5]
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v9, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v17, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v8, s4
+; GFX10-NEXT:    v_and_b32_e32 v25, 0x7f, v16
+; GFX10-NEXT:    v_or_b32_e32 v6, v6, v10
+; GFX10-NEXT:    v_and_b32_e32 v20, 0x7f, v20
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v9, s4
+; GFX10-NEXT:    v_cndmask_b32_e32 v26, 0, v2, vcc_lo
+; GFX10-NEXT:    v_sub_nc_u32_e32 v8, 64, v25
+; GFX10-NEXT:    v_cndmask_b32_e32 v27, 0, v3, vcc_lo
+; GFX10-NEXT:    v_add_nc_u32_e32 v16, 0xffffffc0, v25
+; GFX10-NEXT:    v_sub_nc_u32_e32 v18, 64, v20
+; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v25, v[4:5]
+; GFX10-NEXT:    v_lshrrev_b64 v[2:3], v8, v[4:5]
+; GFX10-NEXT:    v_lshlrev_b64 v[8:9], v25, v[6:7]
+; GFX10-NEXT:    v_lshlrev_b64 v[4:5], v16, v[4:5]
 ; GFX10-NEXT:    v_or_b32_e32 v0, v23, v0
-; GFX10-NEXT:    v_sub_nc_u32_e32 v9, 64, v25
-; GFX10-NEXT:    v_or_b32_e32 v6, v6, v8
-; GFX10-NEXT:    v_and_b32_e32 v23, 0x7f, v20
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v26, 0, v3, s4
-; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v9, v[4:5]
-; GFX10-NEXT:    v_lshlrev_b64 v[10:11], v25, v[6:7]
-; GFX10-NEXT:    v_sub_nc_u32_e32 v20, 64, v23
-; GFX10-NEXT:    v_add_nc_u32_e32 v3, 0xffffffc0, v25
-; GFX10-NEXT:    v_or_b32_e32 v2, v18, v2
-; GFX10-NEXT:    v_lshlrev_b64 v[16:17], v25, v[4:5]
-; GFX10-NEXT:    v_lshrrev_b64 v[18:19], v23, v[12:13]
-; GFX10-NEXT:    v_or_b32_e32 v10, v8, v10
-; GFX10-NEXT:    v_add_nc_u32_e32 v8, 0xffffffc0, v23
-; GFX10-NEXT:    v_lshlrev_b64 v[20:21], v20, v[14:15]
 ; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v25
-; GFX10-NEXT:    v_lshlrev_b64 v[3:4], v3, v[4:5]
-; GFX10-NEXT:    v_or_b32_e32 v5, v9, v11
-; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v8, v[14:15]
-; GFX10-NEXT:    v_cmp_gt_u32_e64 s4, 64, v23
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, 0, v16, vcc_lo
-; GFX10-NEXT:    v_or_b32_e32 v16, v18, v20
-; GFX10-NEXT:    v_or_b32_e32 v18, v19, v21
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v3, v10, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v4, v5, vcc_lo
+; GFX10-NEXT:    v_add_nc_u32_e32 v23, 0xffffffc0, v20
+; GFX10-NEXT:    v_lshrrev_b64 v[16:17], v20, v[12:13]
+; GFX10-NEXT:    v_or_b32_e32 v8, v2, v8
+; GFX10-NEXT:    v_lshlrev_b64 v[18:19], v18, v[14:15]
+; GFX10-NEXT:    v_or_b32_e32 v2, v21, v26
+; GFX10-NEXT:    v_or_b32_e32 v9, v3, v9
+; GFX10-NEXT:    v_cmp_gt_u32_e64 s5, 64, v20
+; GFX10-NEXT:    v_cndmask_b32_e32 v21, v4, v8, vcc_lo
 ; GFX10-NEXT:    v_lshrrev_b64 v[3:4], v23, v[14:15]
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v8, v16, s4
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 0, v23
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v25
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, v9, v18, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, 0, v17, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v8, v16, v18
+; GFX10-NEXT:    v_or_b32_e32 v16, v17, v19
+; GFX10-NEXT:    v_cndmask_b32_e32 v5, v5, v9, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 0, v25
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 0, v20
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v8, s5
+; GFX10-NEXT:    v_lshrrev_b64 v[8:9], v20, v[14:15]
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v16, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v10, 0, v10, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v11, 0, v11, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v21, v6, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v3, v12, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v12, v4, v13, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, 0, v8, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v9, 0, v9, s5
 ; GFX10-NEXT:    v_or_b32_e32 v1, v24, v1
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v10, v6, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v8, v12, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v9, v13, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, 0, v3, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v10, 0, v4, s4
-; GFX10-NEXT:    v_or_b32_e32 v3, v22, v26
-; GFX10-NEXT:    v_or_b32_e32 v4, v11, v5
-; GFX10-NEXT:    v_or_b32_e32 v5, v14, v8
-; GFX10-NEXT:    v_or_b32_e32 v6, v6, v9
-; GFX10-NEXT:    v_or_b32_e32 v7, v7, v10
+; GFX10-NEXT:    v_or_b32_e32 v3, v22, v27
+; GFX10-NEXT:    v_or_b32_e32 v4, v10, v5
+; GFX10-NEXT:    v_or_b32_e32 v5, v11, v12
+; GFX10-NEXT:    v_or_b32_e32 v6, v6, v8
+; GFX10-NEXT:    v_or_b32_e32 v7, v7, v9
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_fshr_v2i128:
@@ -7879,95 +7884,93 @@ define <2 x i128> @v_fshr_v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %a
 ; GFX11-NEXT:    v_and_b32_e32 v25, 0x7f, v17
 ; GFX11-NEXT:    v_lshrrev_b32_e32 v17, 31, v1
 ; GFX11-NEXT:    v_lshlrev_b64 v[0:1], 1, v[0:1]
-; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v25
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_sub_nc_u32_e32 v18, 64, v25
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11-NEXT:    v_or_b32_e32 v2, v2, v17
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v25
 ; GFX11-NEXT:    v_lshlrev_b64 v[23:24], v25, v[0:1]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_dual_cndmask_b32 v23, 0, v23 :: v_dual_and_b32 v26, 0x7f, v16
-; GFX11-NEXT:    v_cndmask_b32_e32 v24, 0, v24, vcc_lo
-; GFX11-NEXT:    v_sub_nc_u32_e32 v18, 64, v25
-; GFX11-NEXT:    v_lshlrev_b64 v[21:22], v25, v[2:3]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v26
+; GFX11-NEXT:    v_and_b32_e32 v26, 0x7f, v16
 ; GFX11-NEXT:    v_lshrrev_b64 v[17:18], v18, v[0:1]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_lshlrev_b64 v[21:22], v25, v[2:3]
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_dual_cndmask_b32 v24, 0, v24 :: v_dual_add_nc_u32 v19, 0xffffffc0, v25
+; GFX11-NEXT:    v_cndmask_b32_e32 v23, 0, v23, vcc_lo
 ; GFX11-NEXT:    v_or_b32_e32 v22, v18, v22
-; GFX11-NEXT:    v_add_nc_u32_e32 v19, 0xffffffc0, v25
-; GFX11-NEXT:    v_or_b32_e32 v21, v17, v21
 ; GFX11-NEXT:    v_sub_nc_u32_e32 v18, 64, v26
+; GFX11-NEXT:    v_or_b32_e32 v21, v17, v21
+; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v19, v[0:1]
 ; GFX11-NEXT:    v_lshrrev_b64 v[16:17], v26, v[8:9]
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-NEXT:    v_lshlrev_b64 v[0:1], v19, v[0:1]
 ; GFX11-NEXT:    v_lshlrev_b64 v[18:19], v18, v[10:11]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_dual_cndmask_b32 v22, v1, v22 :: v_dual_cndmask_b32 v21, v0, v21
-; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v25
-; GFX11-NEXT:    v_add_nc_u32_e32 v27, 0xffffffc0, v26
+; GFX11-NEXT:    v_dual_cndmask_b32 v21, v0, v21 :: v_dual_cndmask_b32 v22, v1, v22
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v26
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_3)
 ; GFX11-NEXT:    v_or_b32_e32 v16, v16, v18
+; GFX11-NEXT:    v_add_nc_u32_e32 v27, 0xffffffc0, v26
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v25
 ; GFX11-NEXT:    v_or_b32_e32 v17, v17, v19
-; GFX11-NEXT:    v_cndmask_b32_e32 v22, v22, v3, vcc_lo
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_lshrrev_b64 v[0:1], v27, v[10:11]
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v16, s0
-; GFX11-NEXT:    v_not_b32_e32 v16, v20
-; GFX11-NEXT:    v_cndmask_b32_e32 v18, v21, v2, vcc_lo
-; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v26
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, v17, s0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-NEXT:    v_cndmask_b32_e64 v21, v21, v2, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v22, v22, v3, s0
 ; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v26, v[10:11]
-; GFX11-NEXT:    v_and_b32_e32 v25, 0x7f, v16
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_dual_cndmask_b32 v0, v0, v8 :: v_dual_cndmask_b32 v1, v1, v9
-; GFX11-NEXT:    v_lshrrev_b32_e32 v8, 31, v5
+; GFX11-NEXT:    v_lshrrev_b32_e32 v10, 31, v5
 ; GFX11-NEXT:    v_lshlrev_b64 v[4:5], 1, v[4:5]
-; GFX11-NEXT:    v_sub_nc_u32_e32 v9, 64, v25
-; GFX11-NEXT:    v_cndmask_b32_e64 v26, 0, v3, s0
-; GFX11-NEXT:    v_add_nc_u32_e32 v3, 0xffffffc0, v25
-; GFX11-NEXT:    v_or_b32_e32 v6, v6, v8
-; GFX11-NEXT:    v_or_b32_e32 v0, v23, v0
-; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v9, v[4:5]
-; GFX11-NEXT:    v_lshlrev_b64 v[16:17], v25, v[4:5]
-; GFX11-NEXT:    v_lshlrev_b64 v[3:4], v3, v[4:5]
-; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v25, v[6:7]
+; GFX11-NEXT:    v_cndmask_b32_e32 v0, v0, v16, vcc_lo
+; GFX11-NEXT:    v_not_b32_e32 v16, v20
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v26
+; GFX11-NEXT:    v_or_b32_e32 v6, v6, v10
+; GFX11-NEXT:    v_dual_cndmask_b32 v1, v1, v17 :: v_dual_and_b32 v20, 0x7f, v20
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_dual_cndmask_b32 v26, 0, v2 :: v_dual_and_b32 v25, 0x7f, v16
+; GFX11-NEXT:    v_cndmask_b32_e32 v27, 0, v3, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v8, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, v9, s0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-NEXT:    v_lshlrev_b64 v[10:11], v25, v[4:5]
 ; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 64, v25
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s0
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v25
+; GFX11-NEXT:    v_sub_nc_u32_e32 v18, 64, v20
+; GFX11-NEXT:    v_or_b32_e32 v0, v23, v0
+; GFX11-NEXT:    v_add_nc_u32_e32 v23, 0xffffffc0, v20
+; GFX11-NEXT:    v_cmp_gt_u32_e64 s1, 64, v20
+; GFX11-NEXT:    v_cndmask_b32_e32 v10, 0, v10, vcc_lo
+; GFX11-NEXT:    v_sub_nc_u32_e32 v8, 64, v25
+; GFX11-NEXT:    v_add_nc_u32_e32 v16, 0xffffffc0, v25
+; GFX11-NEXT:    v_lshlrev_b64 v[18:19], v18, v[14:15]
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 0, v25
+; GFX11-NEXT:    v_cmp_eq_u32_e64 s2, 0, v20
+; GFX11-NEXT:    v_lshrrev_b64 v[2:3], v8, v[4:5]
+; GFX11-NEXT:    v_lshlrev_b64 v[8:9], v25, v[6:7]
+; GFX11-NEXT:    v_lshlrev_b64 v[4:5], v16, v[4:5]
+; GFX11-NEXT:    v_lshrrev_b64 v[16:17], v20, v[12:13]
+; GFX11-NEXT:    v_cndmask_b32_e32 v11, 0, v11, vcc_lo
 ; GFX11-NEXT:    v_or_b32_e32 v1, v24, v1
-; GFX11-NEXT:    v_or_b32_e32 v10, v8, v10
-; GFX11-NEXT:    v_and_b32_e32 v23, 0x7f, v20
-; GFX11-NEXT:    v_or_b32_e32 v2, v18, v2
-; GFX11-NEXT:    v_or_b32_e32 v5, v9, v11
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_dual_cndmask_b32 v11, 0, v16 :: v_dual_cndmask_b32 v10, v3, v10
-; GFX11-NEXT:    v_sub_nc_u32_e32 v20, 64, v23
-; GFX11-NEXT:    v_add_nc_u32_e32 v8, 0xffffffc0, v23
-; GFX11-NEXT:    v_lshrrev_b64 v[18:19], v23, v[12:13]
-; GFX11-NEXT:    v_cmp_gt_u32_e64 s0, 64, v23
-; GFX11-NEXT:    v_cndmask_b32_e32 v5, v4, v5, vcc_lo
-; GFX11-NEXT:    v_lshlrev_b64 v[20:21], v20, v[14:15]
-; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v8, v[14:15]
+; GFX11-NEXT:    v_or_b32_e32 v8, v2, v8
+; GFX11-NEXT:    v_or_b32_e32 v2, v21, v26
+; GFX11-NEXT:    v_or_b32_e32 v9, v3, v9
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-NEXT:    v_cndmask_b32_e32 v21, v4, v8, vcc_lo
 ; GFX11-NEXT:    v_lshrrev_b64 v[3:4], v23, v[14:15]
-; GFX11-NEXT:    v_cndmask_b32_e32 v14, 0, v17, vcc_lo
-; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 0, v23
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, v10, v6, s2
-; GFX11-NEXT:    v_or_b32_e32 v16, v18, v20
-; GFX11-NEXT:    v_or_b32_e32 v18, v19, v21
-; GFX11-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s2
-; GFX11-NEXT:    v_cndmask_b32_e64 v10, 0, v4, s0
+; GFX11-NEXT:    v_or_b32_e32 v8, v16, v18
+; GFX11-NEXT:    v_or_b32_e32 v16, v17, v19
+; GFX11-NEXT:    v_cndmask_b32_e32 v5, v5, v9, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v21, v6, s0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v8, s1
+; GFX11-NEXT:    v_lshrrev_b64 v[8:9], v20, v[14:15]
+; GFX11-NEXT:    v_cndmask_b32_e64 v4, v4, v16, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v7, v5, v7, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v5, v3, v12, s2
+; GFX11-NEXT:    v_or_b32_e32 v3, v22, v27
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-NEXT:    v_cndmask_b32_e64 v12, v4, v13, s2
+; GFX11-NEXT:    v_cndmask_b32_e64 v8, 0, v8, s1
+; GFX11-NEXT:    v_cndmask_b32_e64 v9, 0, v9, s1
+; GFX11-NEXT:    v_or_b32_e32 v4, v10, v5
+; GFX11-NEXT:    v_or_b32_e32 v5, v11, v12
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, v8, v16, s0
-; GFX11-NEXT:    v_cndmask_b32_e64 v9, v9, v18, s0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-NEXT:    v_or_b32_e32 v7, v7, v10
-; GFX11-NEXT:    v_cndmask_b32_e64 v5, v8, v12, s1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e64 v8, v9, v13, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v9, 0, v3, s0
-; GFX11-NEXT:    v_or_b32_e32 v3, v22, v26
-; GFX11-NEXT:    v_or_b32_e32 v4, v11, v5
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_or_b32_e32 v5, v14, v8
-; GFX11-NEXT:    v_or_b32_e32 v6, v6, v9
+; GFX11-NEXT:    v_or_b32_e32 v6, v6, v8
+; GFX11-NEXT:    v_or_b32_e32 v7, v7, v9
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %result = call <2 x i128> @llvm.fshr.v2i128(<2 x i128> %lhs, <2 x i128> %rhs, <2 x i128> %amt)
   ret <2 x i128> %result
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll
index 4ae98ff1edf6c..2eb7486a2684d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll
@@ -261,8 +261,8 @@ define amdgpu_ps void @insertelement_s_v2i16_v_s(ptr addrspace(4) inreg %ptr, i1
 ; GFX11-NEXT:    v_and_b32_e32 v2, 0xffff, v0
 ; GFX11-NEXT:    s_lshl_b32 s1, s1, 4
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_lshl_b32 s2, 0xffff, s1
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_and_not1_b32 s0, s0, s2
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -632,8 +632,8 @@ define amdgpu_ps void @insertelement_v_v2i16_v_s(ptr addrspace(1) %ptr, i16 %val
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v2, s0, v0
 ; GFX11-NEXT:    s_lshl_b32 s0, 0xffff, s0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_not_b32 s0, s0
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    v_and_or_b32 v2, v3, s0, v2
 ; GFX11-NEXT:    global_store_b32 v[0:1], v2, off
@@ -906,8 +906,8 @@ define amdgpu_ps void @insertelement_v_v4i16_s_s(ptr addrspace(1) %ptr, i16 inre
 ; GFX11-NEXT:    v_and_or_b32 v4, v2, s2, s1
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[2:3], v[0:1], off
 ; GFX11-NEXT:    s_endpgm
   %vec = load <4 x i16>, ptr addrspace(1 ) %ptr
@@ -1440,8 +1440,8 @@ define amdgpu_ps void @insertelement_v_v4i16_s_v(ptr addrspace(1) %ptr, i16 inre
 ; GFX11-NEXT:    v_and_or_b32 v4, v4, v3, v2
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[2:3], v[0:1], off
 ; GFX11-NEXT:    s_endpgm
   %vec = load <4 x i16>, ptr addrspace(1) %ptr
@@ -1685,8 +1685,8 @@ define amdgpu_ps void @insertelement_v_v4i16_v_v(ptr addrspace(1) %ptr, i16 %val
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[2:3], v[0:1], off
 ; GFX11-NEXT:    s_endpgm
   %vec = load <4 x i16>, ptr addrspace(1) %ptr
@@ -2387,8 +2387,8 @@ define amdgpu_ps void @insertelement_s_v8i16_s_v(ptr addrspace(4) inreg %ptr, i1
 ; GFX11-NEXT:    v_and_or_b32 v7, v7, v5, v4
 ; GFX11-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v5, 0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v7, s2
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v7, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s1
 ; GFX11-NEXT:    global_store_b128 v[4:5], v[0:3], off
@@ -2573,8 +2573,8 @@ define amdgpu_ps void @insertelement_s_v8i16_v_v(ptr addrspace(4) inreg %ptr, i1
 ; GFX11-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v5, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v7, s2
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v7, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s1
 ; GFX11-NEXT:    global_store_b128 v[4:5], v[0:3], off
@@ -2726,11 +2726,12 @@ define amdgpu_ps void @insertelement_v_v8i16_s_v(ptr addrspace(1) %ptr, i16 inre
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v5, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v6, s1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-NEXT:    v_and_or_b32 v9, v2, v7, v0
 ; GFX11-NEXT:    v_mov_b32_e32 v7, 0
-; GFX11-NEXT:    v_dual_mov_b32 v8, 0 :: v_dual_cndmask_b32 v1, v4, v9
+; GFX11-NEXT:    v_mov_b32_e32 v8, 0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v3, v9, s2
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v4, v9, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v5, v9, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v6, v9, s1
 ; GFX11-NEXT:    global_store_b128 v[7:8], v[0:3], off
@@ -4128,9 +4129,9 @@ define amdgpu_ps void @insertelement_s_v16i16_v_v(ptr addrspace(4) inreg %ptr, i
 ; GFX11-NEXT:    v_mov_b32_e32 v4, s12
 ; GFX11-NEXT:    v_mov_b32_e32 v6, s14
 ; GFX11-NEXT:    v_mov_b32_e32 v8, 0
-; GFX11-NEXT:    v_mov_b32_e32 v9, 0
-; GFX11-NEXT:    v_dual_cndmask_b32 v1, v1, v13 :: v_dual_mov_b32 v10, 16
+; GFX11-NEXT:    v_dual_mov_b32 v9, 0 :: v_dual_mov_b32 v10, 16
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v13, s6
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v13, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v13, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v13, s1
 ; GFX11-NEXT:    v_mov_b32_e32 v11, 0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll
index d4b9bc6d2e3c1..706a5c6c03320 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll
@@ -1048,8 +1048,8 @@ define amdgpu_ps void @insertelement_s_v4i8_v_s(ptr addrspace(4) inreg %ptr, i8
 ; GFX11-NEXT:    v_and_b32_e32 v2, 0xff, v0
 ; GFX11-NEXT:    s_lshl_b32 s1, s1, 3
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_lshl_b32 s2, 0xff, s1
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_and_not1_b32 s0, s0, s2
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -1419,8 +1419,8 @@ define amdgpu_ps void @insertelement_v_v4i8_v_s(ptr addrspace(1) %ptr, i8 %val,
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v2, s0, v0
 ; GFX11-NEXT:    s_lshl_b32 s0, 0xff, s0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_not_b32 s0, s0
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    v_and_or_b32 v2, v3, s0, v2
 ; GFX11-NEXT:    global_store_b32 v[0:1], v2, off
@@ -1769,8 +1769,8 @@ define amdgpu_ps void @insertelement_v_v8i8_s_s(ptr addrspace(1) %ptr, i8 inreg
 ; GFX11-NEXT:    v_and_or_b32 v4, v2, s2, s1
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[2:3], v[0:1], off
 ; GFX11-NEXT:    s_endpgm
   %vec = load <8 x i8>, ptr addrspace(1 ) %ptr
@@ -2303,8 +2303,8 @@ define amdgpu_ps void @insertelement_v_v8i8_s_v(ptr addrspace(1) %ptr, i8 inreg
 ; GFX11-NEXT:    v_and_or_b32 v4, v4, v3, v2
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[2:3], v[0:1], off
 ; GFX11-NEXT:    s_endpgm
   %vec = load <8 x i8>, ptr addrspace(1) %ptr
@@ -2548,8 +2548,8 @@ define amdgpu_ps void @insertelement_v_v8i8_v_v(ptr addrspace(1) %ptr, i8 %val,
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v4, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[2:3], v[0:1], off
 ; GFX11-NEXT:    s_endpgm
   %vec = load <8 x i8>, ptr addrspace(1) %ptr
@@ -3250,8 +3250,8 @@ define amdgpu_ps void @insertelement_s_v16i8_s_v(ptr addrspace(4) inreg %ptr, i8
 ; GFX11-NEXT:    v_and_or_b32 v7, v7, v5, v4
 ; GFX11-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v5, 0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v7, s2
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v7, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s1
 ; GFX11-NEXT:    global_store_b128 v[4:5], v[0:3], off
@@ -3436,8 +3436,8 @@ define amdgpu_ps void @insertelement_s_v16i8_v_v(ptr addrspace(4) inreg %ptr, i8
 ; GFX11-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v5, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, v7, s2
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v7, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v7, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s1
 ; GFX11-NEXT:    global_store_b128 v[4:5], v[0:3], off
@@ -3589,11 +3589,12 @@ define amdgpu_ps void @insertelement_v_v16i8_s_v(ptr addrspace(1) %ptr, i8 inreg
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v5, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v6, s1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-NEXT:    v_and_or_b32 v9, v2, v7, v0
 ; GFX11-NEXT:    v_mov_b32_e32 v7, 0
-; GFX11-NEXT:    v_dual_mov_b32 v8, 0 :: v_dual_cndmask_b32 v1, v4, v9
+; GFX11-NEXT:    v_mov_b32_e32 v8, 0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v3, v9, s2
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v4, v9, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v5, v9, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v6, v9, s1
 ; GFX11-NEXT:    global_store_b128 v[7:8], v[0:3], off
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll
index 5eca04c02a9f9..8134eb3ca2afc 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll
@@ -2498,21 +2498,19 @@ define amdgpu_ps void @dyn_insertelement_v8f64_v_v_v_add_1(<8 x double> %vec, do
 ; GFX11-NEXT:    v_cmp_eq_u32_e64 s1, 7, v18
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, v16, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, v17, s0
-; GFX11-NEXT:    v_cndmask_b32_e32 v5, v5, v17, vcc_lo
 ; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 3, v18
-; GFX11-NEXT:    v_cndmask_b32_e32 v4, v4, v16, vcc_lo
+; GFX11-NEXT:    v_dual_cndmask_b32 v4, v4, v16 :: v_dual_cndmask_b32 v5, v5, v17
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 4, v18
 ; GFX11-NEXT:    v_cndmask_b32_e64 v14, v14, v16, s1
-; GFX11-NEXT:    v_cndmask_b32_e64 v15, v15, v17, s1
 ; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, v16, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v7, v7, v17, s0
-; GFX11-NEXT:    v_cndmask_b32_e32 v9, v9, v17, vcc_lo
 ; GFX11-NEXT:    v_cmp_eq_u32_e64 s0, 5, v18
-; GFX11-NEXT:    v_cndmask_b32_e32 v8, v8, v16, vcc_lo
+; GFX11-NEXT:    v_dual_cndmask_b32 v8, v8, v16 :: v_dual_cndmask_b32 v9, v9, v17
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 6, v18
+; GFX11-NEXT:    v_cndmask_b32_e64 v15, v15, v17, s1
 ; GFX11-NEXT:    v_cndmask_b32_e64 v10, v10, v16, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v11, v11, v17, s0
-; GFX11-NEXT:    v_dual_cndmask_b32 v13, v13, v17 :: v_dual_cndmask_b32 v12, v12, v16
+; GFX11-NEXT:    v_dual_cndmask_b32 v12, v12, v16 :: v_dual_cndmask_b32 v13, v13, v17
 ; GFX11-NEXT:    global_store_b128 v[0:1], v[0:3], off dlc
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    global_store_b128 v[0:1], v[4:7], off dlc
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index c862335764dd4..b241568a89c86 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1023,12 +1023,12 @@ define amdgpu_kernel void @image_bvh64_intersect_ray_a16_nsa_reassign(ptr %p_ray
 ; GFX11-NEXT:    s_mov_b32 s10, 0x45004800
 ; GFX11-NEXT:    v_mov_b32_e32 v6, 0xb36211c6
 ; GFX11-NEXT:    v_bfrev_b32_e32 v7, 4.0
-; GFX11-NEXT:    v_mov_b32_e32 v3, s8
-; GFX11-NEXT:    v_dual_mov_b32 v5, s10 :: v_dual_mov_b32 v4, s9
+; GFX11-NEXT:    v_dual_mov_b32 v3, s8 :: v_dual_mov_b32 v4, s9
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s7
+; GFX11-NEXT:    v_dual_mov_b32 v5, s10 :: v_dual_mov_b32 v0, s6
+; GFX11-NEXT:    v_mov_b32_e32 v1, s7
 ; GFX11-NEXT:    s_mov_b32 s6, 2.0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
 ; GFX11-NEXT:    flat_load_b32 v8, v[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll
index 784611cf68dd2..f591afb5ccae1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll
@@ -1673,10 +1673,10 @@ define i65 @v_lshr_i65(i65 %value, i65 %amount) {
 ; GFX11-NEXT:    v_add_nc_u32_e32 v10, 0xffffffc0, v3
 ; GFX11-NEXT:    v_lshrrev_b64 v[10:11], v10, v[4:5]
 ; GFX11-NEXT:    v_lshrrev_b64 v[4:5], v3, v[4:5]
-; GFX11-NEXT:    v_cndmask_b32_e32 v5, v11, v6, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e32 v2, v10, v2, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v5, v1, s0
+; GFX11-NEXT:    v_cndmask_b32_e32 v5, v11, v6, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v2, v0, s0
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v5, v1, s0
 ; GFX11-NEXT:    v_cndmask_b32_e32 v2, 0, v4, vcc_lo
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %result = lshr i65 %value, %amount
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
index 455446aa38c60..88e14c3de0e9a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
@@ -2078,64 +2078,64 @@ define i256 @v_mul_i256(i256 %num, i256 %den) {
 ; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v0, v14, 0
 ; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v0, v12, 0
 ; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v1, v13, v[16:17]
-; GFX7-NEXT:    v_mul_lo_u32 v28, v4, v11
-; GFX7-NEXT:    v_mul_lo_u32 v27, v5, v10
+; GFX7-NEXT:    v_mad_u64_u32 v[20:21], s[6:7], v0, v10, 0
 ; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v2, v12, v[16:17]
-; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v3, v11, v[16:17]
-; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v4, v10, v[16:17]
 ; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v1, v11, v[18:19]
-; GFX7-NEXT:    v_cndmask_b32_e64 v20, 0, 1, s[4:5]
-; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v5, v9, v[16:17]
+; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v3, v11, v[16:17]
 ; GFX7-NEXT:    v_mad_u64_u32 v[18:19], vcc, v2, v10, v[18:19]
-; GFX7-NEXT:    v_addc_u32_e32 v20, vcc, 0, v20, vcc
+; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v4, v10, v[16:17]
+; GFX7-NEXT:    v_cndmask_b32_e64 v22, 0, 1, s[4:5]
+; GFX7-NEXT:    v_addc_u32_e32 v22, vcc, 0, v22, vcc
+; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[8:9], v5, v9, v[16:17]
 ; GFX7-NEXT:    v_mad_u64_u32 v[18:19], vcc, v3, v9, v[18:19]
-; GFX7-NEXT:    v_addc_u32_e32 v20, vcc, 0, v20, vcc
-; GFX7-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v0, v10, 0
+; GFX7-NEXT:    v_addc_u32_e32 v22, vcc, 0, v22, vcc
 ; GFX7-NEXT:    v_mad_u64_u32 v[18:19], vcc, v4, v8, v[18:19]
 ; GFX7-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v6, v8, v[16:17]
-; GFX7-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v1, v9, v[21:22]
-; GFX7-NEXT:    v_addc_u32_e32 v25, vcc, 0, v20, vcc
-; GFX7-NEXT:    v_mov_b32_e32 v20, v18
+; GFX7-NEXT:    v_mad_u64_u32 v[20:21], s[6:7], v1, v9, v[20:21]
+; GFX7-NEXT:    v_addc_u32_e32 v23, vcc, 0, v22, vcc
+; GFX7-NEXT:    v_mov_b32_e32 v22, v18
 ; GFX7-NEXT:    v_mov_b32_e32 v18, v19
 ; GFX7-NEXT:    v_mov_b32_e32 v19, v16
 ; GFX7-NEXT:    v_mad_u64_u32 v[18:19], vcc, v0, v13, v[18:19]
 ; GFX7-NEXT:    v_mul_lo_u32 v16, v6, v9
-; GFX7-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
-; GFX7-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v2, v8, v[21:22]
-; GFX7-NEXT:    v_addc_u32_e64 v26, s[4:5], 0, v6, s[4:5]
-; GFX7-NEXT:    v_mad_u64_u32 v[23:24], s[4:5], v1, v12, v[18:19]
-; GFX7-NEXT:    v_mov_b32_e32 v19, v22
-; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[12:13], v0, v11, v[19:20]
-; GFX7-NEXT:    v_mad_u64_u32 v[22:23], s[6:7], v2, v11, v[23:24]
-; GFX7-NEXT:    v_mul_lo_u32 v24, v3, v12
-; GFX7-NEXT:    v_mad_u64_u32 v[11:12], s[8:9], v3, v10, v[22:23]
-; GFX7-NEXT:    v_mul_lo_u32 v22, v2, v13
-; GFX7-NEXT:    v_mad_u64_u32 v[12:13], s[10:11], v4, v9, v[11:12]
-; GFX7-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s[12:13]
-; GFX7-NEXT:    v_mad_u64_u32 v[10:11], s[12:13], v1, v10, v[18:19]
-; GFX7-NEXT:    v_addc_u32_e64 v4, s[12:13], 0, v4, s[12:13]
-; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[12:13], v2, v9, v[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[6:7]
+; GFX7-NEXT:    v_mad_u64_u32 v[20:21], s[4:5], v2, v8, v[20:21]
+; GFX7-NEXT:    v_addc_u32_e64 v24, s[4:5], 0, v6, s[4:5]
+; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v1, v12, v[18:19]
+; GFX7-NEXT:    v_mad_u64_u32 v[21:22], s[10:11], v0, v11, v[21:22]
+; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[6:7], v2, v11, v[18:19]
+; GFX7-NEXT:    v_mul_lo_u32 v26, v4, v11
+; GFX7-NEXT:    v_mul_lo_u32 v27, v3, v12
+; GFX7-NEXT:    v_mad_u64_u32 v[11:12], s[8:9], v3, v10, v[18:19]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[10:11]
+; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[10:11], v1, v10, v[21:22]
+; GFX7-NEXT:    v_mul_lo_u32 v25, v5, v10
+; GFX7-NEXT:    v_mul_lo_u32 v28, v2, v13
+; GFX7-NEXT:    v_mad_u64_u32 v[12:13], s[12:13], v4, v9, v[11:12]
 ; GFX7-NEXT:    v_mad_u64_u32 v[10:11], s[14:15], v0, v8, 0
-; GFX7-NEXT:    v_addc_u32_e64 v2, s[12:13], 0, v4, s[12:13]
+; GFX7-NEXT:    v_addc_u32_e64 v22, s[10:11], 0, v6, s[10:11]
+; GFX7-NEXT:    v_mad_u64_u32 v[18:19], s[10:11], v2, v9, v[18:19]
+; GFX7-NEXT:    v_mov_b32_e32 v21, v20
 ; GFX7-NEXT:    v_mov_b32_e32 v20, v11
 ; GFX7-NEXT:    v_mad_u64_u32 v[20:21], s[16:17], v0, v9, v[20:21]
-; GFX7-NEXT:    v_mad_u64_u32 v[3:4], s[12:13], v3, v8, v[18:19]
+; GFX7-NEXT:    v_addc_u32_e64 v2, s[10:11], 0, v22, s[10:11]
+; GFX7-NEXT:    v_mad_u64_u32 v[3:4], s[10:11], v3, v8, v[18:19]
 ; GFX7-NEXT:    v_mad_u64_u32 v[5:6], s[14:15], v5, v8, v[12:13]
-; GFX7-NEXT:    v_addc_u32_e64 v11, s[12:13], 0, v2, s[12:13]
+; GFX7-NEXT:    v_addc_u32_e64 v11, s[10:11], 0, v2, s[10:11]
 ; GFX7-NEXT:    v_mul_lo_u32 v9, v1, v14
 ; GFX7-NEXT:    v_cndmask_b32_e64 v12, 0, 1, s[16:17]
-; GFX7-NEXT:    v_mad_u64_u32 v[1:2], s[12:13], v1, v8, v[20:21]
-; GFX7-NEXT:    v_addc_u32_e64 v3, s[12:13], v12, v3, s[12:13]
+; GFX7-NEXT:    v_mad_u64_u32 v[1:2], s[10:11], v1, v8, v[20:21]
+; GFX7-NEXT:    v_addc_u32_e64 v3, s[10:11], v12, v3, s[10:11]
 ; GFX7-NEXT:    v_mul_lo_u32 v0, v0, v15
-; GFX7-NEXT:    v_addc_u32_e64 v4, s[12:13], v26, v4, s[12:13]
-; GFX7-NEXT:    v_addc_u32_e64 v5, s[12:13], v11, v5, s[12:13]
-; GFX7-NEXT:    v_addc_u32_e64 v6, s[12:13], v25, v6, s[12:13]
-; GFX7-NEXT:    v_addc_u32_e64 v0, s[12:13], v17, v0, s[12:13]
-; GFX7-NEXT:    v_addc_u32_e64 v0, s[12:13], v0, v9, s[14:15]
-; GFX7-NEXT:    v_addc_u32_e64 v0, s[10:11], v0, v22, s[10:11]
-; GFX7-NEXT:    v_addc_u32_e64 v0, s[8:9], v0, v24, s[8:9]
-; GFX7-NEXT:    v_addc_u32_e64 v0, s[6:7], v0, v28, s[6:7]
-; GFX7-NEXT:    v_addc_u32_e64 v0, s[4:5], v0, v27, s[4:5]
+; GFX7-NEXT:    v_addc_u32_e64 v4, s[10:11], v24, v4, s[10:11]
+; GFX7-NEXT:    v_addc_u32_e64 v5, s[10:11], v11, v5, s[10:11]
+; GFX7-NEXT:    v_addc_u32_e64 v6, s[10:11], v23, v6, s[10:11]
+; GFX7-NEXT:    v_addc_u32_e64 v0, s[10:11], v17, v0, s[10:11]
+; GFX7-NEXT:    v_addc_u32_e64 v0, s[10:11], v0, v9, s[14:15]
+; GFX7-NEXT:    v_addc_u32_e64 v0, s[10:11], v0, v28, s[12:13]
+; GFX7-NEXT:    v_addc_u32_e64 v0, s[8:9], v0, v27, s[8:9]
+; GFX7-NEXT:    v_addc_u32_e64 v0, s[6:7], v0, v26, s[6:7]
+; GFX7-NEXT:    v_addc_u32_e64 v0, s[4:5], v0, v25, s[4:5]
 ; GFX7-NEXT:    v_addc_u32_e32 v0, vcc, v0, v16, vcc
 ; GFX7-NEXT:    v_mad_u64_u32 v[7:8], s[4:5], v7, v8, v[0:1]
 ; GFX7-NEXT:    v_mov_b32_e32 v0, v10
@@ -2147,64 +2147,64 @@ define i256 @v_mul_i256(i256 %num, i256 %den) {
 ; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v0, v14, 0
 ; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v0, v12, 0
 ; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v1, v13, v[16:17]
-; GFX8-NEXT:    v_mul_lo_u32 v28, v4, v11
-; GFX8-NEXT:    v_mul_lo_u32 v27, v5, v10
+; GFX8-NEXT:    v_mad_u64_u32 v[20:21], s[6:7], v0, v10, 0
 ; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v2, v12, v[16:17]
-; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v3, v11, v[16:17]
-; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v4, v10, v[16:17]
 ; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v1, v11, v[18:19]
-; GFX8-NEXT:    v_cndmask_b32_e64 v20, 0, 1, s[4:5]
-; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v5, v9, v[16:17]
+; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v3, v11, v[16:17]
 ; GFX8-NEXT:    v_mad_u64_u32 v[18:19], vcc, v2, v10, v[18:19]
-; GFX8-NEXT:    v_addc_u32_e32 v20, vcc, 0, v20, vcc
+; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v4, v10, v[16:17]
+; GFX8-NEXT:    v_cndmask_b32_e64 v22, 0, 1, s[4:5]
+; GFX8-NEXT:    v_addc_u32_e32 v22, vcc, 0, v22, vcc
+; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[8:9], v5, v9, v[16:17]
 ; GFX8-NEXT:    v_mad_u64_u32 v[18:19], vcc, v3, v9, v[18:19]
-; GFX8-NEXT:    v_addc_u32_e32 v20, vcc, 0, v20, vcc
-; GFX8-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v0, v10, 0
+; GFX8-NEXT:    v_addc_u32_e32 v22, vcc, 0, v22, vcc
 ; GFX8-NEXT:    v_mad_u64_u32 v[18:19], vcc, v4, v8, v[18:19]
 ; GFX8-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v6, v8, v[16:17]
-; GFX8-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v1, v9, v[21:22]
-; GFX8-NEXT:    v_addc_u32_e32 v25, vcc, 0, v20, vcc
-; GFX8-NEXT:    v_mov_b32_e32 v20, v18
+; GFX8-NEXT:    v_mad_u64_u32 v[20:21], s[6:7], v1, v9, v[20:21]
+; GFX8-NEXT:    v_addc_u32_e32 v23, vcc, 0, v22, vcc
+; GFX8-NEXT:    v_mov_b32_e32 v22, v18
 ; GFX8-NEXT:    v_mov_b32_e32 v18, v19
 ; GFX8-NEXT:    v_mov_b32_e32 v19, v16
 ; GFX8-NEXT:    v_mad_u64_u32 v[18:19], vcc, v0, v13, v[18:19]
 ; GFX8-NEXT:    v_mul_lo_u32 v16, v6, v9
-; GFX8-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
-; GFX8-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v2, v8, v[21:22]
-; GFX8-NEXT:    v_addc_u32_e64 v26, s[4:5], 0, v6, s[4:5]
-; GFX8-NEXT:    v_mad_u64_u32 v[23:24], s[4:5], v1, v12, v[18:19]
-; GFX8-NEXT:    v_mov_b32_e32 v19, v22
-; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[12:13], v0, v11, v[19:20]
-; GFX8-NEXT:    v_mad_u64_u32 v[22:23], s[6:7], v2, v11, v[23:24]
-; GFX8-NEXT:    v_mul_lo_u32 v24, v3, v12
-; GFX8-NEXT:    v_mad_u64_u32 v[11:12], s[8:9], v3, v10, v[22:23]
-; GFX8-NEXT:    v_mul_lo_u32 v22, v2, v13
-; GFX8-NEXT:    v_mad_u64_u32 v[12:13], s[10:11], v4, v9, v[11:12]
-; GFX8-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s[12:13]
-; GFX8-NEXT:    v_mad_u64_u32 v[10:11], s[12:13], v1, v10, v[18:19]
-; GFX8-NEXT:    v_addc_u32_e64 v4, s[12:13], 0, v4, s[12:13]
-; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[12:13], v2, v9, v[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[6:7]
+; GFX8-NEXT:    v_mad_u64_u32 v[20:21], s[4:5], v2, v8, v[20:21]
+; GFX8-NEXT:    v_addc_u32_e64 v24, s[4:5], 0, v6, s[4:5]
+; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v1, v12, v[18:19]
+; GFX8-NEXT:    v_mad_u64_u32 v[21:22], s[10:11], v0, v11, v[21:22]
+; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[6:7], v2, v11, v[18:19]
+; GFX8-NEXT:    v_mul_lo_u32 v26, v4, v11
+; GFX8-NEXT:    v_mul_lo_u32 v27, v3, v12
+; GFX8-NEXT:    v_mad_u64_u32 v[11:12], s[8:9], v3, v10, v[18:19]
+; GFX8-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[10:11]
+; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[10:11], v1, v10, v[21:22]
+; GFX8-NEXT:    v_mul_lo_u32 v25, v5, v10
+; GFX8-NEXT:    v_mul_lo_u32 v28, v2, v13
+; GFX8-NEXT:    v_mad_u64_u32 v[12:13], s[12:13], v4, v9, v[11:12]
 ; GFX8-NEXT:    v_mad_u64_u32 v[10:11], s[14:15], v0, v8, 0
-; GFX8-NEXT:    v_addc_u32_e64 v2, s[12:13], 0, v4, s[12:13]
+; GFX8-NEXT:    v_addc_u32_e64 v22, s[10:11], 0, v6, s[10:11]
+; GFX8-NEXT:    v_mad_u64_u32 v[18:19], s[10:11], v2, v9, v[18:19]
+; GFX8-NEXT:    v_mov_b32_e32 v21, v20
 ; GFX8-NEXT:    v_mov_b32_e32 v20, v11
 ; GFX8-NEXT:    v_mad_u64_u32 v[20:21], s[16:17], v0, v9, v[20:21]
-; GFX8-NEXT:    v_mad_u64_u32 v[3:4], s[12:13], v3, v8, v[18:19]
+; GFX8-NEXT:    v_addc_u32_e64 v2, s[10:11], 0, v22, s[10:11]
+; GFX8-NEXT:    v_mad_u64_u32 v[3:4], s[10:11], v3, v8, v[18:19]
 ; GFX8-NEXT:    v_mad_u64_u32 v[5:6], s[14:15], v5, v8, v[12:13]
-; GFX8-NEXT:    v_addc_u32_e64 v11, s[12:13], 0, v2, s[12:13]
+; GFX8-NEXT:    v_addc_u32_e64 v11, s[10:11], 0, v2, s[10:11]
 ; GFX8-NEXT:    v_mul_lo_u32 v9, v1, v14
 ; GFX8-NEXT:    v_cndmask_b32_e64 v12, 0, 1, s[16:17]
-; GFX8-NEXT:    v_mad_u64_u32 v[1:2], s[12:13], v1, v8, v[20:21]
-; GFX8-NEXT:    v_addc_u32_e64 v3, s[12:13], v12, v3, s[12:13]
+; GFX8-NEXT:    v_mad_u64_u32 v[1:2], s[10:11], v1, v8, v[20:21]
+; GFX8-NEXT:    v_addc_u32_e64 v3, s[10:11], v12, v3, s[10:11]
 ; GFX8-NEXT:    v_mul_lo_u32 v0, v0, v15
-; GFX8-NEXT:    v_addc_u32_e64 v4, s[12:13], v26, v4, s[12:13]
-; GFX8-NEXT:    v_addc_u32_e64 v5, s[12:13], v11, v5, s[12:13]
-; GFX8-NEXT:    v_addc_u32_e64 v6, s[12:13], v25, v6, s[12:13]
-; GFX8-NEXT:    v_addc_u32_e64 v0, s[12:13], v17, v0, s[12:13]
-; GFX8-NEXT:    v_addc_u32_e64 v0, s[12:13], v0, v9, s[14:15]
-; GFX8-NEXT:    v_addc_u32_e64 v0, s[10:11], v0, v22, s[10:11]
-; GFX8-NEXT:    v_addc_u32_e64 v0, s[8:9], v0, v24, s[8:9]
-; GFX8-NEXT:    v_addc_u32_e64 v0, s[6:7], v0, v28, s[6:7]
-; GFX8-NEXT:    v_addc_u32_e64 v0, s[4:5], v0, v27, s[4:5]
+; GFX8-NEXT:    v_addc_u32_e64 v4, s[10:11], v24, v4, s[10:11]
+; GFX8-NEXT:    v_addc_u32_e64 v5, s[10:11], v11, v5, s[10:11]
+; GFX8-NEXT:    v_addc_u32_e64 v6, s[10:11], v23, v6, s[10:11]
+; GFX8-NEXT:    v_addc_u32_e64 v0, s[10:11], v17, v0, s[10:11]
+; GFX8-NEXT:    v_addc_u32_e64 v0, s[10:11], v0, v9, s[14:15]
+; GFX8-NEXT:    v_addc_u32_e64 v0, s[10:11], v0, v28, s[12:13]
+; GFX8-NEXT:    v_addc_u32_e64 v0, s[8:9], v0, v27, s[8:9]
+; GFX8-NEXT:    v_addc_u32_e64 v0, s[6:7], v0, v26, s[6:7]
+; GFX8-NEXT:    v_addc_u32_e64 v0, s[4:5], v0, v25, s[4:5]
 ; GFX8-NEXT:    v_addc_u32_e32 v0, vcc, v0, v16, vcc
 ; GFX8-NEXT:    v_mad_u64_u32 v[7:8], s[4:5], v7, v8, v[0:1]
 ; GFX8-NEXT:    v_mov_b32_e32 v0, v10
@@ -2216,64 +2216,64 @@ define i256 @v_mul_i256(i256 %num, i256 %den) {
 ; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v0, v14, 0
 ; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v0, v12, 0
 ; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v1, v13, v[16:17]
-; GFX9-NEXT:    v_mul_lo_u32 v28, v4, v11
-; GFX9-NEXT:    v_mul_lo_u32 v27, v5, v10
+; GFX9-NEXT:    v_mad_u64_u32 v[20:21], s[6:7], v0, v10, 0
 ; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v2, v12, v[16:17]
-; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v3, v11, v[16:17]
-; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v4, v10, v[16:17]
 ; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v1, v11, v[18:19]
-; GFX9-NEXT:    v_cndmask_b32_e64 v20, 0, 1, s[4:5]
-; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v5, v9, v[16:17]
+; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v3, v11, v[16:17]
 ; GFX9-NEXT:    v_mad_u64_u32 v[18:19], vcc, v2, v10, v[18:19]
-; GFX9-NEXT:    v_addc_co_u32_e32 v20, vcc, 0, v20, vcc
+; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[6:7], v4, v10, v[16:17]
+; GFX9-NEXT:    v_cndmask_b32_e64 v22, 0, 1, s[4:5]
+; GFX9-NEXT:    v_addc_co_u32_e32 v22, vcc, 0, v22, vcc
+; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[8:9], v5, v9, v[16:17]
 ; GFX9-NEXT:    v_mad_u64_u32 v[18:19], vcc, v3, v9, v[18:19]
-; GFX9-NEXT:    v_addc_co_u32_e32 v20, vcc, 0, v20, vcc
-; GFX9-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v0, v10, 0
+; GFX9-NEXT:    v_addc_co_u32_e32 v22, vcc, 0, v22, vcc
 ; GFX9-NEXT:    v_mad_u64_u32 v[18:19], vcc, v4, v8, v[18:19]
 ; GFX9-NEXT:    v_mad_u64_u32 v[16:17], s[4:5], v6, v8, v[16:17]
-; GFX9-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v1, v9, v[21:22]
-; GFX9-NEXT:    v_addc_co_u32_e32 v25, vcc, 0, v20, vcc
-; GFX9-NEXT:    v_mov_b32_e32 v20, v18
+; GFX9-NEXT:    v_mad_u64_u32 v[20:21], s[6:7], v1, v9, v[20:21]
+; GFX9-NEXT:    v_addc_co_u32_e32 v23, vcc, 0, v22, vcc
+; GFX9-NEXT:    v_mov_b32_e32 v22, v18
 ; GFX9-NEXT:    v_mov_b32_e32 v18, v19
 ; GFX9-NEXT:    v_mov_b32_e32 v19, v16
 ; GFX9-NEXT:    v_mad_u64_u32 v[18:19], vcc, v0, v13, v[18:19]
 ; GFX9-NEXT:    v_mul_lo_u32 v16, v6, v9
-; GFX9-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
-; GFX9-NEXT:    v_mad_u64_u32 v[21:22], s[4:5], v2, v8, v[21:22]
-; GFX9-NEXT:    v_addc_co_u32_e64 v26, s[4:5], 0, v6, s[4:5]
-; GFX9-NEXT:    v_mad_u64_u32 v[23:24], s[4:5], v1, v12, v[18:19]
-; GFX9-NEXT:    v_mov_b32_e32 v19, v22
-; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[12:13], v0, v11, v[19:20]
-; GFX9-NEXT:    v_mad_u64_u32 v[22:23], s[6:7], v2, v11, v[23:24]
-; GFX9-NEXT:    v_mul_lo_u32 v24, v3, v12
-; GFX9-NEXT:    v_mad_u64_u32 v[11:12], s[8:9], v3, v10, v[22:23]
-; GFX9-NEXT:    v_mul_lo_u32 v22, v2, v13
-; GFX9-NEXT:    v_mad_u64_u32 v[12:13], s[10:11], v4, v9, v[11:12]
-; GFX9-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s[12:13]
-; GFX9-NEXT:    v_mad_u64_u32 v[10:11], s[12:13], v1, v10, v[18:19]
-; GFX9-NEXT:    v_addc_co_u32_e64 v4, s[12:13], 0, v4, s[12:13]
-; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[12:13], v2, v9, v[10:11]
+; GFX9-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[6:7]
+; GFX9-NEXT:    v_mad_u64_u32 v[20:21], s[4:5], v2, v8, v[20:21]
+; GFX9-NEXT:    v_addc_co_u32_e64 v24, s[4:5], 0, v6, s[4:5]
+; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[4:5], v1, v12, v[18:19]
+; GFX9-NEXT:    v_mad_u64_u32 v[21:22], s[10:11], v0, v11, v[21:22]
+; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[6:7], v2, v11, v[18:19]
+; GFX9-NEXT:    v_mul_lo_u32 v26, v4, v11
+; GFX9-NEXT:    v_mul_lo_u32 v27, v3, v12
+; GFX9-NEXT:    v_mad_u64_u32 v[11:12], s[8:9], v3, v10, v[18:19]
+; GFX9-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[10:11]
+; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[10:11], v1, v10, v[21:22]
+; GFX9-NEXT:    v_mul_lo_u32 v25, v5, v10
+; GFX9-NEXT:    v_mul_lo_u32 v28, v2, v13
+; GFX9-NEXT:    v_mad_u64_u32 v[12:13], s[12:13], v4, v9, v[11:12]
 ; GFX9-NEXT:    v_mad_u64_u32 v[10:11], s[14:15], v0, v8, 0
-; GFX9-NEXT:    v_addc_co_u32_e64 v2, s[12:13], 0, v4, s[12:13]
+; GFX9-NEXT:    v_addc_co_u32_e64 v22, s[10:11], 0, v6, s[10:11]
+; GFX9-NEXT:    v_mad_u64_u32 v[18:19], s[10:11], v2, v9, v[18:19]
+; GFX9-NEXT:    v_mov_b32_e32 v21, v20
 ; GFX9-NEXT:    v_mov_b32_e32 v20, v11
 ; GFX9-NEXT:    v_mad_u64_u32 v[20:21], s[16:17], v0, v9, v[20:21]
-; GFX9-NEXT:    v_mad_u64_u32 v[3:4], s[12:13], v3, v8, v[18:19]
+; GFX9-NEXT:    v_addc_co_u32_e64 v2, s[10:11], 0, v22, s[10:11]
+; GFX9-NEXT:    v_mad_u64_u32 v[3:4], s[10:11], v3, v8, v[18:19]
 ; GFX9-NEXT:    v_mad_u64_u32 v[5:6], s[14:15], v5, v8, v[12:13]
-; GFX9-NEXT:    v_addc_co_u32_e64 v11, s[12:13], 0, v2, s[12:13]
+; GFX9-NEXT:    v_addc_co_u32_e64 v11, s[10:11], 0, v2, s[10:11]
 ; GFX9-NEXT:    v_mul_lo_u32 v9, v1, v14
 ; GFX9-NEXT:    v_cndmask_b32_e64 v12, 0, 1, s[16:17]
-; GFX9-NEXT:    v_mad_u64_u32 v[1:2], s[12:13], v1, v8, v[20:21]
-; GFX9-NEXT:    v_addc_co_u32_e64 v3, s[12:13], v12, v3, s[12:13]
+; GFX9-NEXT:    v_mad_u64_u32 v[1:2], s[10:11], v1, v8, v[20:21]
+; GFX9-NEXT:    v_addc_co_u32_e64 v3, s[10:11], v12, v3, s[10:11]
 ; GFX9-NEXT:    v_mul_lo_u32 v0, v0, v15
-; GFX9-NEXT:    v_addc_co_u32_e64 v4, s[12:13], v26, v4, s[12:13]
-; GFX9-NEXT:    v_addc_co_u32_e64 v5, s[12:13], v11, v5, s[12:13]
-; GFX9-NEXT:    v_addc_co_u32_e64 v6, s[12:13], v25, v6, s[12:13]
-; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[12:13], v17, v0, s[12:13]
-; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[12:13], v0, v9, s[14:15]
-; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[10:11], v0, v22, s[10:11]
-; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[8:9], v0, v24, s[8:9]
-; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[6:7], v0, v28, s[6:7]
-; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[4:5], v0, v27, s[4:5]
+; GFX9-NEXT:    v_addc_co_u32_e64 v4, s[10:11], v24, v4, s[10:11]
+; GFX9-NEXT:    v_addc_co_u32_e64 v5, s[10:11], v11, v5, s[10:11]
+; GFX9-NEXT:    v_addc_co_u32_e64 v6, s[10:11], v23, v6, s[10:11]
+; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[10:11], v17, v0, s[10:11]
+; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[10:11], v0, v9, s[14:15]
+; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[10:11], v0, v28, s[12:13]
+; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[8:9], v0, v27, s[8:9]
+; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[6:7], v0, v26, s[6:7]
+; GFX9-NEXT:    v_addc_co_u32_e64 v0, s[4:5], v0, v25, s[4:5]
 ; GFX9-NEXT:    v_addc_co_u32_e32 v0, vcc, v0, v16, vcc
 ; GFX9-NEXT:    v_mad_u64_u32 v[7:8], s[4:5], v7, v8, v[0:1]
 ; GFX9-NEXT:    v_mov_b32_e32 v0, v10
@@ -2476,12 +2476,11 @@ define i256 @v_mul_i256(i256 %num, i256 %den) {
 ; GFX12-NEXT:    s_wait_alu 0xf1ff
 ; GFX12-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s2
 ; GFX12-NEXT:    v_mul_lo_u32 v24, v2, v13
-; GFX12-NEXT:    v_mov_b32_e32 v13, v1
 ; GFX12-NEXT:    v_mad_co_u64_u32 v[11:12], s2, v17, v10, v[14:15]
 ; GFX12-NEXT:    v_mad_co_u64_u32 v[18:19], s3, v3, v10, v[18:19]
 ; GFX12-NEXT:    s_wait_alu 0xf1ff
 ; GFX12-NEXT:    v_add_co_ci_u32_e64 v6, null, 0, v6, s2
-; GFX12-NEXT:    v_mov_b32_e32 v14, v21
+; GFX12-NEXT:    v_dual_mov_b32 v13, v1 :: v_dual_mov_b32 v14, v21
 ; GFX12-NEXT:    v_mad_co_u64_u32 v[1:2], s2, v2, v9, v[11:12]
 ; GFX12-NEXT:    s_wait_alu 0xf1ff
 ; GFX12-NEXT:    v_add_co_ci_u32_e64 v6, null, 0, v6, s2
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
index 723ad5646c0a3..a7ea52344cfef 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
@@ -5037,17 +5037,17 @@ define amdgpu_ps <2 x i64> @s_saddsat_v2i64(<2 x i64> inreg %lhs, <2 x i64> inre
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s1, s[4:5], 0
 ; GFX10-NEXT:    s_ashr_i32 s4, s9, 31
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s9
-; GFX10-NEXT:    s_add_i32 s5, s4, 0x80000000
-; GFX10-NEXT:    s_xor_b32 s8, s1, s0
+; GFX10-NEXT:    s_add_i32 s8, s4, 0x80000000
+; GFX10-NEXT:    s_xor_b32 s5, s1, s0
 ; GFX10-NEXT:    s_add_u32 s0, s2, s6
 ; GFX10-NEXT:    s_addc_u32 s1, s3, s7
 ; GFX10-NEXT:    v_mov_b32_e32 v2, s0
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s2, s[0:1], s[2:3]
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s3, s[6:7], 0
 ; GFX10-NEXT:    v_mov_b32_e32 v3, s1
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s8
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s5
 ; GFX10-NEXT:    s_ashr_i32 s4, s1, 31
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s8
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s5
 ; GFX10-NEXT:    s_add_i32 s0, s4, 0x80000000
 ; GFX10-NEXT:    s_xor_b32 s1, s3, s2
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s4, s1
@@ -5066,16 +5066,16 @@ define amdgpu_ps <2 x i64> @s_saddsat_v2i64(<2 x i64> inreg %lhs, <2 x i64> inre
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s0, s[8:9], s[0:1]
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s1, s[4:5], 0
 ; GFX11-NEXT:    s_ashr_i32 s4, s9, 31
-; GFX11-NEXT:    s_add_i32 s5, s4, 0x80000000
-; GFX11-NEXT:    s_xor_b32 s8, s1, s0
+; GFX11-NEXT:    s_add_i32 s8, s4, 0x80000000
+; GFX11-NEXT:    s_xor_b32 s5, s1, s0
 ; GFX11-NEXT:    s_add_u32 s0, s2, s6
 ; GFX11-NEXT:    s_addc_u32 s1, s3, s7
 ; GFX11-NEXT:    v_dual_mov_b32 v2, s0 :: v_dual_mov_b32 v3, s1
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s2, s[0:1], s[2:3]
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s3, s[6:7], 0
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s8
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s5
 ; GFX11-NEXT:    s_ashr_i32 s4, s1, 31
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s8
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s5
 ; GFX11-NEXT:    s_add_i32 s0, s4, 0x80000000
 ; GFX11-NEXT:    s_xor_b32 s1, s3, s2
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s4, s1
@@ -5450,8 +5450,8 @@ define amdgpu_ps <4 x float> @saddsat_i128_sv(i128 inreg %lhs, i128 %rhs) {
 ; GFX11-NEXT:    v_add_nc_u32_e32 v6, 0x80000000, v3
 ; GFX11-NEXT:    v_and_b32_e32 v2, 1, v2
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v2
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e32 v0, v0, v3, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc_lo
 ; GFX11-NEXT:    v_dual_cndmask_b32 v2, v4, v3 :: v_dual_cndmask_b32 v3, v5, v6
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.sadd.sat.i128(i128 %lhs, i128 %rhs)
@@ -5611,8 +5611,8 @@ define amdgpu_ps <4 x float> @saddsat_i128_vs(i128 %lhs, i128 inreg %rhs) {
 ; GFX11-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX11-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc_lo
 ; GFX11-NEXT:    v_dual_cndmask_b32 v0, v4, v2 :: v_dual_cndmask_b32 v3, v7, v3
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e32 v2, v6, v2, vcc_lo
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.sadd.sat.i128(i128 %lhs, i128 %rhs)
@@ -6145,8 +6145,8 @@ define amdgpu_ps <2 x i128> @s_saddsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX10-NEXT:    s_addc_u32 s16, s2, s10
 ; GFX10-NEXT:    v_cmp_lt_u64_e64 s0, s[8:9], s[0:1]
 ; GFX10-NEXT:    s_addc_u32 s17, s3, s11
-; GFX10-NEXT:    v_mov_b32_e32 v4, s9
 ; GFX10-NEXT:    s_cmp_eq_u64 s[16:17], s[2:3]
+; GFX10-NEXT:    v_mov_b32_e32 v4, s17
 ; GFX10-NEXT:    s_cselect_b32 s18, 1, 0
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s0
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s0, s[16:17], s[2:3]
@@ -6176,7 +6176,7 @@ define amdgpu_ps <2 x i128> @s_saddsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s4, s[2:3], s[6:7]
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s6, s[14:15], 0
 ; GFX10-NEXT:    v_and_b32_e32 v0, 1, v0
-; GFX10-NEXT:    v_mov_b32_e32 v6, s1
+; GFX10-NEXT:    v_mov_b32_e32 v6, s2
 ; GFX10-NEXT:    v_mov_b32_e32 v7, s3
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s4
 ; GFX10-NEXT:    s_and_b32 s4, 1, s12
@@ -6188,31 +6188,31 @@ define amdgpu_ps <2 x i128> @s_saddsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 s4, 0, s5
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc_lo
 ; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, s16
+; GFX10-NEXT:    v_mov_b32_e32 v0, s9
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, v3, 0, s4
 ; GFX10-NEXT:    v_mov_b32_e32 v3, s8
 ; GFX10-NEXT:    s_ashr_i32 s4, s3, 31
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, s10, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, s11, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s10, vcc_lo
 ; GFX10-NEXT:    v_xor_b32_e32 v1, v2, v1
-; GFX10-NEXT:    v_mov_b32_e32 v2, s17
+; GFX10-NEXT:    v_mov_b32_e32 v2, s16
 ; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, s10, vcc_lo
 ; GFX10-NEXT:    s_add_i32 s0, s4, 0x80000000
-; GFX10-NEXT:    v_readfirstlane_b32 s1, v4
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v4
 ; GFX10-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s11, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s10, vcc_lo
 ; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v1
-; GFX10-NEXT:    v_mov_b32_e32 v1, s2
-; GFX10-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX10-NEXT:    v_readfirstlane_b32 s3, v2
+; GFX10-NEXT:    v_mov_b32_e32 v1, s1
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
 ; GFX10-NEXT:    v_cndmask_b32_e64 v5, v5, s4, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, s0, vcc_lo
 ; GFX10-NEXT:    v_readfirstlane_b32 s0, v3
 ; GFX10-NEXT:    v_readfirstlane_b32 s4, v5
-; GFX10-NEXT:    v_readfirstlane_b32 s5, v6
-; GFX10-NEXT:    v_readfirstlane_b32 s6, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s5, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s6, v6
 ; GFX10-NEXT:    v_readfirstlane_b32 s7, v7
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
@@ -6247,12 +6247,14 @@ define amdgpu_ps <2 x i128> @s_saddsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX11-NEXT:    s_addc_u32 s3, s7, s15
 ; GFX11-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX11-NEXT:    s_cmp_eq_u64 s[2:3], s[6:7]
-; GFX11-NEXT:    v_dual_mov_b32 v6, s1 :: v_dual_mov_b32 v7, s3
+; GFX11-NEXT:    v_mov_b32_e32 v4, s17
+; GFX11-NEXT:    s_cselect_b32 s12, 1, 0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s4
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s4, s[2:3], s[6:7]
-; GFX11-NEXT:    s_cselect_b32 s12, 1, 0
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s6, s[14:15], 0
-; GFX11-NEXT:    v_dual_mov_b32 v5, s0 :: v_dual_and_b32 v0, 1, v0
+; GFX11-NEXT:    v_and_b32_e32 v0, 1, v0
+; GFX11-NEXT:    v_dual_mov_b32 v6, s2 :: v_dual_mov_b32 v7, s3
+; GFX11-NEXT:    v_mov_b32_e32 v5, s0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s4
 ; GFX11-NEXT:    s_and_b32 s4, 1, s12
 ; GFX11-NEXT:    s_cmp_eq_u64 s[14:15], 0
@@ -6265,30 +6267,29 @@ define amdgpu_ps <2 x i128> @s_saddsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v3, 0, s4
 ; GFX11-NEXT:    v_mov_b32_e32 v3, s8
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX11-NEXT:    v_mov_b32_e32 v0, s16
 ; GFX11-NEXT:    s_ashr_i32 s4, s3, 31
 ; GFX11-NEXT:    v_xor_b32_e32 v1, v2, v1
-; GFX11-NEXT:    v_mov_b32_e32 v4, s9
-; GFX11-NEXT:    v_mov_b32_e32 v2, s17
+; GFX11-NEXT:    v_mov_b32_e32 v0, s9
+; GFX11-NEXT:    v_mov_b32_e32 v2, s16
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, s10, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s10, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v4, v4, s11, vcc_lo
 ; GFX11-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX11-NEXT:    v_cndmask_b32_e64 v4, v4, s10, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s11, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s10, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s10, vcc_lo
 ; GFX11-NEXT:    s_add_i32 s0, s4, 0x80000000
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v4
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v1
-; GFX11-NEXT:    v_mov_b32_e32 v1, s2
-; GFX11-NEXT:    v_readfirstlane_b32 s1, v4
-; GFX11-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX11-NEXT:    v_readfirstlane_b32 s3, v2
+; GFX11-NEXT:    v_mov_b32_e32 v1, s1
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
 ; GFX11-NEXT:    v_cndmask_b32_e64 v5, v5, s4, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v7, v7, s0, vcc_lo
 ; GFX11-NEXT:    v_readfirstlane_b32 s0, v3
 ; GFX11-NEXT:    v_readfirstlane_b32 s4, v5
-; GFX11-NEXT:    v_readfirstlane_b32 s5, v6
-; GFX11-NEXT:    v_readfirstlane_b32 s6, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s5, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s6, v6
 ; GFX11-NEXT:    v_readfirstlane_b32 s7, v7
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call <2 x i128> @llvm.sadd.sat.v2i128(<2 x i128> %lhs, <2 x i128> %rhs)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll
index b59f85b2dfa38..02f8d0bf3c3df 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll
@@ -813,30 +813,30 @@ define amdgpu_kernel void @sdivrem_v2i32(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, s0, v2
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v3, s7, v3
+; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s4, v2
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v6, s4, v2
-; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s3, v3
-; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s4, v2
+; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s3, v3
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s3, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, v6, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v3, v7, vcc_lo
-; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v2, v6, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v5, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v4, 1, v0
-; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s4, v2
-; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s3, v3
+; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s4, v2
+; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
+; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s3, v3
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v6, s4, v2
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s3, v3
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, v6, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v3, v7, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
+; GFX10-NEXT:    v_mov_b32_e32 v4, 0
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v5, s0
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v2, v6, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s0
 ; GFX10-NEXT:    s_xor_b32 s0, s6, s2
 ; GFX10-NEXT:    v_xor_b32_e32 v0, s1, v0
 ; GFX10-NEXT:    v_xor_b32_e32 v1, s0, v1
 ; GFX10-NEXT:    v_xor_b32_e32 v2, s5, v2
 ; GFX10-NEXT:    v_xor_b32_e32 v3, s6, v3
-; GFX10-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v0, s1, v0
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v1, s0, v1
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v2, s5, v2
@@ -2568,8 +2568,8 @@ define amdgpu_kernel void @sdivrem_v2i8(ptr addrspace(1) %out0, ptr addrspace(1)
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, s4, v2
 ; GFX10-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x0
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v3, s0, v3
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s1, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s1, v2
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s1, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s3, v3
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s3, v3
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
@@ -2581,13 +2581,13 @@ define amdgpu_kernel void @sdivrem_v2i8(ptr addrspace(1) %out0, ptr addrspace(1)
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s1, v2
 ; GFX10-NEXT:    v_add_nc_u32_e32 v6, 1, v1
 ; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s3, v3
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s3, v3
+; GFX10-NEXT:    s_xor_b32 s1, s11, s2
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v4, s3, v3
 ; GFX10-NEXT:    v_cndmask_b32_e32 v2, v2, v5, vcc_lo
-; GFX10-NEXT:    s_xor_b32 s1, s11, s2
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s0
 ; GFX10-NEXT:    v_xor_b32_e32 v0, s1, v0
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v4, s0
 ; GFX10-NEXT:    v_xor_b32_e32 v2, s11, v2
 ; GFX10-NEXT:    s_xor_b32 s0, s12, s10
 ; GFX10-NEXT:    v_mov_b32_e32 v4, 0xff
@@ -2981,8 +2981,8 @@ define amdgpu_kernel void @sdivrem_v2i16(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, s4, v2
 ; GFX10-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x0
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v3, s0, v3
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s2, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s2, v2
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s2, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s1, v3
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v4, s1, v3
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll
index d6eb4b3477adb..a546b24cc58f9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll
@@ -5043,17 +5043,17 @@ define amdgpu_ps <2 x i64> @s_ssubsat_v2i64(<2 x i64> inreg %lhs, <2 x i64> inre
 ; GFX10-NEXT:    v_cmp_gt_i64_e64 s1, s[4:5], 0
 ; GFX10-NEXT:    s_ashr_i32 s4, s9, 31
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s9
-; GFX10-NEXT:    s_add_i32 s5, s4, 0x80000000
-; GFX10-NEXT:    s_xor_b32 s8, s1, s0
+; GFX10-NEXT:    s_add_i32 s8, s4, 0x80000000
+; GFX10-NEXT:    s_xor_b32 s5, s1, s0
 ; GFX10-NEXT:    s_sub_u32 s0, s2, s6
 ; GFX10-NEXT:    s_subb_u32 s1, s3, s7
 ; GFX10-NEXT:    v_mov_b32_e32 v2, s0
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s2, s[0:1], s[2:3]
 ; GFX10-NEXT:    v_cmp_gt_i64_e64 s3, s[6:7], 0
 ; GFX10-NEXT:    v_mov_b32_e32 v3, s1
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s8
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s5
 ; GFX10-NEXT:    s_ashr_i32 s4, s1, 31
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s8
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s5
 ; GFX10-NEXT:    s_add_i32 s0, s4, 0x80000000
 ; GFX10-NEXT:    s_xor_b32 s1, s3, s2
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s4, s1
@@ -5072,16 +5072,16 @@ define amdgpu_ps <2 x i64> @s_ssubsat_v2i64(<2 x i64> inreg %lhs, <2 x i64> inre
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s0, s[8:9], s[0:1]
 ; GFX11-NEXT:    v_cmp_gt_i64_e64 s1, s[4:5], 0
 ; GFX11-NEXT:    s_ashr_i32 s4, s9, 31
-; GFX11-NEXT:    s_add_i32 s5, s4, 0x80000000
-; GFX11-NEXT:    s_xor_b32 s8, s1, s0
+; GFX11-NEXT:    s_add_i32 s8, s4, 0x80000000
+; GFX11-NEXT:    s_xor_b32 s5, s1, s0
 ; GFX11-NEXT:    s_sub_u32 s0, s2, s6
 ; GFX11-NEXT:    s_subb_u32 s1, s3, s7
 ; GFX11-NEXT:    v_dual_mov_b32 v2, s0 :: v_dual_mov_b32 v3, s1
 ; GFX11-NEXT:    v_cmp_lt_i64_e64 s2, s[0:1], s[2:3]
 ; GFX11-NEXT:    v_cmp_gt_i64_e64 s3, s[6:7], 0
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s8
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s4, s5
 ; GFX11-NEXT:    s_ashr_i32 s4, s1, 31
-; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s5, s8
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s8, s5
 ; GFX11-NEXT:    s_add_i32 s0, s4, 0x80000000
 ; GFX11-NEXT:    s_xor_b32 s1, s3, s2
 ; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s4, s1
@@ -5263,20 +5263,20 @@ define amdgpu_ps i128 @s_ssubsat_i128(i128 inreg %lhs, i128 inreg %rhs) {
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s1
 ; GFX10-NEXT:    s_add_i32 s1, s0, 0x80000000
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, v3, v2, vcc_lo
-; GFX10-NEXT:    v_mov_b32_e32 v2, s9
+; GFX10-NEXT:    v_mov_b32_e32 v2, s10
 ; GFX10-NEXT:    v_mov_b32_e32 v3, s11
 ; GFX10-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s8
 ; GFX10-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, s10
+; GFX10-NEXT:    v_mov_b32_e32 v0, s9
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s0, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s0, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s0, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s0, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, s1, vcc_lo
 ; GFX10-NEXT:    v_readfirstlane_b32 s0, v1
-; GFX10-NEXT:    v_readfirstlane_b32 s1, v2
-; GFX10-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
 ; GFX10-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
@@ -5305,19 +5305,19 @@ define amdgpu_ps i128 @s_ssubsat_i128(i128 inreg %lhs, i128 inreg %rhs) {
 ; GFX11-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc_lo
 ; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s1
 ; GFX11-NEXT:    s_add_i32 s1, s0, 0x80000000
-; GFX11-NEXT:    v_dual_cndmask_b32 v1, v3, v2 :: v_dual_mov_b32 v2, s9
+; GFX11-NEXT:    v_dual_cndmask_b32 v1, v3, v2 :: v_dual_mov_b32 v2, s10
 ; GFX11-NEXT:    v_mov_b32_e32 v3, s11
 ; GFX11-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX11-NEXT:    v_dual_mov_b32 v1, s8 :: v_dual_and_b32 v0, 1, v0
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX11-NEXT:    v_mov_b32_e32 v0, s10
+; GFX11-NEXT:    v_mov_b32_e32 v0, s9
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s0, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s0, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s0, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s0, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, s1, vcc_lo
 ; GFX11-NEXT:    v_readfirstlane_b32 s0, v1
-; GFX11-NEXT:    v_readfirstlane_b32 s1, v2
-; GFX11-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
 ; GFX11-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.ssub.sat.i128(i128 %lhs, i128 %rhs)
@@ -5474,8 +5474,8 @@ define amdgpu_ps <4 x float> @ssubsat_i128_sv(i128 inreg %lhs, i128 %rhs) {
 ; GFX11-NEXT:    v_xor_b32_e32 v0, v0, v8
 ; GFX11-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e32 v0, v4, v2, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc_lo
 ; GFX11-NEXT:    v_dual_cndmask_b32 v2, v6, v2 :: v_dual_cndmask_b32 v3, v7, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.ssub.sat.i128(i128 %lhs, i128 %rhs)
@@ -5646,8 +5646,8 @@ define amdgpu_ps <4 x float> @ssubsat_i128_vs(i128 %lhs, i128 inreg %rhs) {
 ; GFX11-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX11-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e32 v0, v4, v2, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc_lo
 ; GFX11-NEXT:    v_dual_cndmask_b32 v2, v6, v2 :: v_dual_cndmask_b32 v3, v7, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i128 @llvm.ssub.sat.i128(i128 %lhs, i128 %rhs)
@@ -6237,7 +6237,7 @@ define amdgpu_ps <2 x i128> @s_ssubsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX10-NEXT:    s_subb_u32 s3, s7, s15
 ; GFX10-NEXT:    v_mov_b32_e32 v5, s0
 ; GFX10-NEXT:    s_cmp_eq_u64 s[2:3], s[6:7]
-; GFX10-NEXT:    v_mov_b32_e32 v6, s1
+; GFX10-NEXT:    v_mov_b32_e32 v6, s2
 ; GFX10-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s4
 ; GFX10-NEXT:    v_cmp_lt_i64_e64 s4, s[2:3], s[6:7]
@@ -6260,29 +6260,29 @@ define amdgpu_ps <2 x i128> @s_ssubsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
 ; GFX10-NEXT:    v_cndmask_b32_e32 v2, v4, v3, vcc_lo
 ; GFX10-NEXT:    v_mov_b32_e32 v3, s18
-; GFX10-NEXT:    v_mov_b32_e32 v4, s19
 ; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, s16
+; GFX10-NEXT:    v_mov_b32_e32 v0, s19
+; GFX10-NEXT:    v_mov_b32_e32 v4, s17
 ; GFX10-NEXT:    v_xor_b32_e32 v1, v2, v1
-; GFX10-NEXT:    v_mov_b32_e32 v2, s17
+; GFX10-NEXT:    v_mov_b32_e32 v2, s16
 ; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, s8, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, s8, vcc_lo
-; GFX10-NEXT:    v_and_b32_e32 v1, 1, v1
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, s8, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s9, vcc_lo
-; GFX10-NEXT:    v_readfirstlane_b32 s1, v4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, s9, vcc_lo
+; GFX10-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, s8, vcc_lo
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v4
 ; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v1
-; GFX10-NEXT:    v_mov_b32_e32 v1, s2
-; GFX10-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX10-NEXT:    v_readfirstlane_b32 s3, v2
+; GFX10-NEXT:    v_mov_b32_e32 v1, s1
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
 ; GFX10-NEXT:    v_cndmask_b32_e64 v5, v5, s4, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, s0, vcc_lo
 ; GFX10-NEXT:    v_readfirstlane_b32 s0, v3
 ; GFX10-NEXT:    v_readfirstlane_b32 s4, v5
-; GFX10-NEXT:    v_readfirstlane_b32 s5, v6
-; GFX10-NEXT:    v_readfirstlane_b32 s6, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s5, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s6, v6
 ; GFX10-NEXT:    v_readfirstlane_b32 s7, v7
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
@@ -6317,7 +6317,7 @@ define amdgpu_ps <2 x i128> @s_ssubsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX11-NEXT:    v_cmp_lt_u64_e64 s4, s[0:1], s[4:5]
 ; GFX11-NEXT:    v_cndmask_b32_e32 v1, v3, v2, vcc_lo
 ; GFX11-NEXT:    s_subb_u32 s3, s7, s15
-; GFX11-NEXT:    v_dual_mov_b32 v6, s1 :: v_dual_mov_b32 v7, s3
+; GFX11-NEXT:    v_dual_mov_b32 v6, s2 :: v_dual_mov_b32 v7, s3
 ; GFX11-NEXT:    s_cmp_eq_u64 s[2:3], s[6:7]
 ; GFX11-NEXT:    v_xor_b32_e32 v0, v1, v0
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s4
@@ -6335,34 +6335,34 @@ define amdgpu_ps <2 x i128> @s_ssubsat_v2i128(<2 x i128> inreg %lhs, <2 x i128>
 ; GFX11-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX11-NEXT:    s_and_b32 s5, 1, s5
 ; GFX11-NEXT:    s_ashr_i32 s4, s3, 31
-; GFX11-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s6
+; GFX11-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc_lo
 ; GFX11-NEXT:    v_cmp_ne_u32_e64 vcc_lo, 0, s5
 ; GFX11-NEXT:    s_add_i32 s0, s4, 0x80000000
 ; GFX11-NEXT:    v_dual_cndmask_b32 v2, v4, v3 :: v_dual_mov_b32 v3, s18
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
-; GFX11-NEXT:    v_mov_b32_e32 v0, s16
+; GFX11-NEXT:    v_mov_b32_e32 v4, s17
 ; GFX11-NEXT:    v_xor_b32_e32 v1, v2, v1
-; GFX11-NEXT:    v_mov_b32_e32 v4, s19
-; GFX11-NEXT:    v_mov_b32_e32 v2, s17
+; GFX11-NEXT:    v_mov_b32_e32 v0, s19
+; GFX11-NEXT:    v_mov_b32_e32 v2, s16
 ; GFX11-NEXT:    v_cndmask_b32_e64 v3, v3, s8, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s8, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v4, v4, s9, vcc_lo
 ; GFX11-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX11-NEXT:    v_cndmask_b32_e64 v4, v4, s8, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s9, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v0, s8, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v2, v2, s8, vcc_lo
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v4
 ; GFX11-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v1
-; GFX11-NEXT:    v_mov_b32_e32 v1, s2
-; GFX11-NEXT:    v_readfirstlane_b32 s1, v4
-; GFX11-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX11-NEXT:    v_readfirstlane_b32 s3, v2
+; GFX11-NEXT:    v_mov_b32_e32 v1, s1
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
 ; GFX11-NEXT:    v_cndmask_b32_e64 v5, v5, s4, vcc_lo
-; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v1, s4, vcc_lo
+; GFX11-NEXT:    v_cndmask_b32_e64 v6, v6, s4, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v7, v7, s0, vcc_lo
 ; GFX11-NEXT:    v_readfirstlane_b32 s0, v3
 ; GFX11-NEXT:    v_readfirstlane_b32 s4, v5
-; GFX11-NEXT:    v_readfirstlane_b32 s5, v6
-; GFX11-NEXT:    v_readfirstlane_b32 s6, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s5, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s6, v6
 ; GFX11-NEXT:    v_readfirstlane_b32 s7, v7
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call <2 x i128> @llvm.ssub.sat.v2i128(<2 x i128> %lhs, <2 x i128> %rhs)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
index 018e5fb6ee3b8..1a6d26142208f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
@@ -1937,39 +1937,39 @@ define <2 x i64> @v_udiv_v2i64_24bit(<2 x i64> %num, <2 x i64> %den) {
 ; GISEL-NEXT:    v_mul_lo_u32 v18, v10, v7
 ; GISEL-NEXT:    v_mul_hi_u32 v19, v9, v7
 ; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v15, v6
-; GISEL-NEXT:    v_add_i32_e32 v13, vcc, v18, v13
-; GISEL-NEXT:    v_mul_lo_u32 v15, v11, v17
-; GISEL-NEXT:    v_mul_hi_u32 v18, v7, v17
-; GISEL-NEXT:    v_add_i32_e32 v13, vcc, v13, v19
-; GISEL-NEXT:    v_mul_lo_u32 v19, v7, v13
-; GISEL-NEXT:    v_add_i32_e32 v15, vcc, v15, v19
-; GISEL-NEXT:    v_cndmask_b32_e64 v19, 0, 1, vcc
-; GISEL-NEXT:    v_add_i32_e32 v15, vcc, v15, v18
 ; GISEL-NEXT:    v_mul_lo_u32 v15, v8, v14
+; GISEL-NEXT:    v_add_i32_e32 v13, vcc, v18, v13
 ; GISEL-NEXT:    v_mul_hi_u32 v18, v12, v14
 ; GISEL-NEXT:    v_mul_hi_u32 v14, v8, v14
-; GISEL-NEXT:    v_mul_hi_u32 v17, v11, v17
-; GISEL-NEXT:    v_add_i32_e64 v16, s[4:5], v6, v16
+; GISEL-NEXT:    v_add_i32_e32 v16, vcc, v6, v16
 ; GISEL-NEXT:    v_mul_lo_u32 v6, v12, v16
-; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v15, v6
-; GISEL-NEXT:    v_cndmask_b32_e64 v15, 0, 1, s[4:5]
+; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v15, v6
+; GISEL-NEXT:    v_cndmask_b32_e64 v15, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v6, v18
+; GISEL-NEXT:    v_mul_lo_u32 v6, v11, v17
+; GISEL-NEXT:    v_mul_hi_u32 v18, v7, v17
+; GISEL-NEXT:    v_mul_hi_u32 v17, v11, v17
+; GISEL-NEXT:    v_add_i32_e64 v13, s[4:5], v13, v19
+; GISEL-NEXT:    v_mul_lo_u32 v19, v7, v13
+; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v6, v19
+; GISEL-NEXT:    v_cndmask_b32_e64 v19, 0, 1, s[4:5]
 ; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v6, v18
 ; GISEL-NEXT:    v_mul_lo_u32 v6, v8, v16
-; GISEL-NEXT:    v_cndmask_b32_e64 v18, 0, 1, s[4:5]
-; GISEL-NEXT:    v_add_i32_e64 v15, s[4:5], v15, v18
+; GISEL-NEXT:    v_mul_lo_u32 v18, v11, v13
+; GISEL-NEXT:    v_add_i32_e64 v17, s[6:7], v18, v17
+; GISEL-NEXT:    v_cndmask_b32_e64 v18, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v15, vcc, v15, v18
 ; GISEL-NEXT:    v_mul_hi_u32 v18, v12, v16
-; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v6, v14
-; GISEL-NEXT:    v_cndmask_b32_e64 v14, 0, 1, s[4:5]
-; GISEL-NEXT:    v_add_i32_e64 v18, s[4:5], v6, v18
-; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
-; GISEL-NEXT:    v_add_i32_e64 v14, s[4:5], v14, v6
+; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v6, v14
+; GISEL-NEXT:    v_cndmask_b32_e64 v14, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v18, vcc, v6, v18
 ; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v14, vcc, v14, v6
+; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
 ; GISEL-NEXT:    v_add_i32_e32 v19, vcc, v19, v6
-; GISEL-NEXT:    v_mul_lo_u32 v6, v11, v13
-; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v6, v17
-; GISEL-NEXT:    v_mul_hi_u32 v17, v7, v13
-; GISEL-NEXT:    v_cndmask_b32_e64 v20, 0, 1, vcc
-; GISEL-NEXT:    v_add_i32_e32 v17, vcc, v6, v17
+; GISEL-NEXT:    v_mul_hi_u32 v6, v7, v13
+; GISEL-NEXT:    v_cndmask_b32_e64 v20, 0, 1, s[6:7]
+; GISEL-NEXT:    v_add_i32_e32 v17, vcc, v17, v6
 ; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, vcc
 ; GISEL-NEXT:    v_add_i32_e32 v20, vcc, v20, v6
 ; GISEL-NEXT:    v_and_b32_e32 v6, 0xffffff, v0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
index ff0114cfc3ddb..1aaf3122cc00d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
@@ -506,17 +506,17 @@ define amdgpu_kernel void @udivrem_i64(ptr addrspace(1) %out0, ptr addrspace(1)
 ; GFX10-NEXT:    v_sub_co_u32 v10, s0, v6, s18
 ; GFX10-NEXT:    v_subrev_co_ci_u32_e64 v0, s0, 0, v0, s0
 ; GFX10-NEXT:    v_cndmask_b32_e32 v2, v2, v13, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v4, v14, vcc_lo
 ; GFX10-NEXT:    v_cmp_ne_u32_e64 s0, 0, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v6, v10, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v9, v0, vcc_lo
-; GFX10-NEXT:    v_mov_b32_e32 v10, 0
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v4, v14, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v4, v6, v10, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v6, v9, v0, vcc_lo
+; GFX10-NEXT:    v_mov_b32_e32 v9, 0
 ; GFX10-NEXT:    v_cndmask_b32_e64 v0, v5, v2, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v3, v4, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v7, v6, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v8, v9, s0
-; GFX10-NEXT:    global_store_dwordx2 v10, v[0:1], s[12:13]
-; GFX10-NEXT:    global_store_dwordx2 v10, v[2:3], s[14:15]
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v3, v1, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v7, v4, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v8, v6, s0
+; GFX10-NEXT:    global_store_dwordx2 v9, v[0:1], s[12:13]
+; GFX10-NEXT:    global_store_dwordx2 v9, v[2:3], s[14:15]
 ; GFX10-NEXT:    s_endpgm
   %div = udiv i64 %x, %y
   store i64 %div, ptr addrspace(1) %out0
@@ -663,24 +663,24 @@ define amdgpu_kernel void @udivrem_v2i32(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, s16, v2
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v3, s17, v3
+; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s18, v2
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v6, s18, v2
-; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s19, v3
-; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s18, v2
+; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s19, v3
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s19, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, v6, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v3, v7, vcc_lo
-; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v2, v6, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v5, s0
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v4, 1, v0
-; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s18, v2
-; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s19, v3
+; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s18, v2
+; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
+; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s19, v3
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v6, s18, v2
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s19, v3
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v0, v4, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, v6, s0
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v3, v7, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v1, v5, s0
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v2, v6, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v3, v7, s0
 ; GFX10-NEXT:    global_store_dwordx2 v8, v[0:1], s[12:13]
 ; GFX10-NEXT:    global_store_dwordx2 v8, v[2:3], s[14:15]
 ; GFX10-NEXT:    s_endpgm
@@ -1532,10 +1532,12 @@ define amdgpu_kernel void @udivrem_v2i64(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_cvt_f32_u32_e32 v1, s7
 ; GFX10-NEXT:    v_cvt_f32_u32_e32 v2, s4
 ; GFX10-NEXT:    v_cvt_f32_u32_e32 v3, s6
-; GFX10-NEXT:    s_sub_u32 s0, 0, s4
+; GFX10-NEXT:    s_sub_u32 s1, 0, s4
 ; GFX10-NEXT:    v_mul_f32_e32 v0, 0x4f800000, v0
 ; GFX10-NEXT:    v_mul_f32_e32 v1, 0x4f800000, v1
-; GFX10-NEXT:    s_subb_u32 s1, 0, s5
+; GFX10-NEXT:    s_subb_u32 s2, 0, s5
+; GFX10-NEXT:    s_sub_u32 s3, 0, s6
+; GFX10-NEXT:    s_subb_u32 s10, 0, s7
 ; GFX10-NEXT:    v_add_f32_e32 v0, v0, v2
 ; GFX10-NEXT:    v_add_f32_e32 v1, v1, v3
 ; GFX10-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -1554,17 +1556,15 @@ define amdgpu_kernel void @udivrem_v2i64(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_add_f32_e32 v1, v3, v1
 ; GFX10-NEXT:    v_cvt_u32_f32_e32 v7, v0
 ; GFX10-NEXT:    v_cvt_u32_f32_e32 v8, v1
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, s0, v7, 0
-; GFX10-NEXT:    s_sub_u32 s2, 0, s6
-; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s3, s2, v8, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s0, s1, v7, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s0, s3, v8, 0
 ; GFX10-NEXT:    v_mul_hi_u32 v11, v9, v0
-; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s3, s0, v9, v[1:2]
-; GFX10-NEXT:    v_mad_u64_u32 v[5:6], s3, s2, v10, v[3:4]
+; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s0, s1, v9, v[1:2]
+; GFX10-NEXT:    v_mad_u64_u32 v[5:6], s0, s3, v10, v[3:4]
 ; GFX10-NEXT:    v_mul_lo_u32 v6, v9, v0
-; GFX10-NEXT:    s_subb_u32 s3, 0, s7
-; GFX10-NEXT:    v_mad_u64_u32 v[3:4], s10, s1, v7, v[4:5]
+; GFX10-NEXT:    v_mad_u64_u32 v[3:4], s0, s2, v7, v[4:5]
 ; GFX10-NEXT:    v_mul_hi_u32 v4, v7, v0
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s10, s3, v8, v[5:6]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s0, s10, v8, v[5:6]
 ; GFX10-NEXT:    v_mul_lo_u32 v1, v10, v2
 ; GFX10-NEXT:    v_mul_hi_u32 v5, v8, v2
 ; GFX10-NEXT:    v_mul_hi_u32 v2, v10, v2
@@ -1576,45 +1576,45 @@ define amdgpu_kernel void @udivrem_v2i64(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_mul_hi_u32 v17, v8, v0
 ; GFX10-NEXT:    v_mul_hi_u32 v3, v9, v3
 ; GFX10-NEXT:    v_mul_hi_u32 v0, v10, v0
-; GFX10-NEXT:    v_add_co_u32 v6, s10, v6, v12
-; GFX10-NEXT:    v_cndmask_b32_e64 v12, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v11, s10, v13, v11
-; GFX10-NEXT:    v_cndmask_b32_e64 v13, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v1, s10, v1, v15
-; GFX10-NEXT:    v_cndmask_b32_e64 v15, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v2, s10, v16, v2
-; GFX10-NEXT:    v_cndmask_b32_e64 v16, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v4, s10, v6, v4
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v6, s10, v11, v14
-; GFX10-NEXT:    v_cndmask_b32_e64 v11, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v1, s10, v1, v5
+; GFX10-NEXT:    v_add_co_u32 v6, s0, v6, v12
+; GFX10-NEXT:    v_cndmask_b32_e64 v12, 0, 1, s0
+; GFX10-NEXT:    v_add_co_u32 v11, s0, v13, v11
+; GFX10-NEXT:    v_cndmask_b32_e64 v13, 0, 1, s0
+; GFX10-NEXT:    v_add_co_u32 v1, s0, v1, v15
+; GFX10-NEXT:    v_cndmask_b32_e64 v15, 0, 1, s0
+; GFX10-NEXT:    v_add_co_u32 v2, s0, v16, v2
+; GFX10-NEXT:    v_cndmask_b32_e64 v16, 0, 1, s0
+; GFX10-NEXT:    v_add_co_u32 v4, s0, v6, v4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s0
+; GFX10-NEXT:    v_add_co_u32 v6, s0, v11, v14
+; GFX10-NEXT:    v_cndmask_b32_e64 v11, 0, 1, s0
+; GFX10-NEXT:    v_add_co_u32 v1, s0, v1, v5
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v4, v12, v4
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v2, s10, v2, v17
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, 0, 1, s10
-; GFX10-NEXT:    v_add_co_u32 v4, s10, v6, v4
+; GFX10-NEXT:    v_add_co_u32 v2, s0, v2, v17
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, 0, 1, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, v15, v1
+; GFX10-NEXT:    v_add_co_u32 v4, s0, v6, v4
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v11, v13, v11
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s10
+; GFX10-NEXT:    v_add_co_u32 v1, s0, v2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v5, v16, v5
-; GFX10-NEXT:    v_add_co_u32 v7, vcc_lo, v7, v4
-; GFX10-NEXT:    v_add_co_u32 v1, s10, v2, v1
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
 ; GFX10-NEXT:    v_add3_u32 v3, v11, v6, v3
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s10
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v9, vcc_lo, v9, v3, vcc_lo
+; GFX10-NEXT:    v_add_co_u32 v7, vcc_lo, v7, v4
+; GFX10-NEXT:    v_add_co_u32 v8, s0, v8, v1
 ; GFX10-NEXT:    v_add3_u32 v2, v5, v2, v0
-; GFX10-NEXT:    v_add_co_u32 v8, vcc_lo, v8, v1
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s10, s0, v7, 0
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v10, vcc_lo, v10, v2, vcc_lo
-; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s10, s2, v8, 0
+; GFX10-NEXT:    v_add_co_ci_u32_e32 v9, vcc_lo, v9, v3, vcc_lo
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s11, s1, v7, 0
+; GFX10-NEXT:    v_add_co_ci_u32_e64 v10, vcc_lo, v10, v2, s0
+; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s0, s3, v8, 0
 ; GFX10-NEXT:    v_mul_hi_u32 v11, v9, v0
-; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s0, s0, v9, v[1:2]
-; GFX10-NEXT:    v_mad_u64_u32 v[5:6], s0, s2, v10, v[3:4]
+; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s0, s1, v9, v[1:2]
+; GFX10-NEXT:    v_mad_u64_u32 v[5:6], s0, s3, v10, v[3:4]
 ; GFX10-NEXT:    v_mul_lo_u32 v6, v9, v0
-; GFX10-NEXT:    v_mad_u64_u32 v[3:4], s0, s1, v7, v[4:5]
+; GFX10-NEXT:    v_mad_u64_u32 v[3:4], s0, s2, v7, v[4:5]
 ; GFX10-NEXT:    v_mul_hi_u32 v4, v7, v0
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s0, s3, v8, v[5:6]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s0, s10, v8, v[5:6]
 ; GFX10-NEXT:    v_mul_lo_u32 v1, v10, v2
 ; GFX10-NEXT:    v_mul_hi_u32 v5, v8, v2
 ; GFX10-NEXT:    v_mul_hi_u32 v2, v10, v2
@@ -1652,13 +1652,13 @@ define amdgpu_kernel void @udivrem_v2i64(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
 ; GFX10-NEXT:    v_add3_u32 v3, v11, v6, v3
 ; GFX10-NEXT:    v_add_co_u32 v4, vcc_lo, v7, v4
+; GFX10-NEXT:    v_add_co_u32 v1, s0, v8, v1
 ; GFX10-NEXT:    v_add3_u32 v0, v5, v2, v0
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v2, vcc_lo, v9, v3, vcc_lo
-; GFX10-NEXT:    v_add_co_u32 v1, vcc_lo, v8, v1
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v0, vcc_lo, v10, v0, vcc_lo
 ; GFX10-NEXT:    v_mul_lo_u32 v3, s17, v4
-; GFX10-NEXT:    v_mul_lo_u32 v8, s16, v2
 ; GFX10-NEXT:    v_mul_hi_u32 v5, s16, v4
+; GFX10-NEXT:    v_add_co_ci_u32_e64 v0, vcc_lo, v10, v0, s0
+; GFX10-NEXT:    v_mul_lo_u32 v8, s16, v2
 ; GFX10-NEXT:    v_mul_hi_u32 v4, s17, v4
 ; GFX10-NEXT:    v_mul_lo_u32 v9, s17, v2
 ; GFX10-NEXT:    v_mul_lo_u32 v6, s19, v1
@@ -2063,8 +2063,8 @@ define amdgpu_kernel void @udivrem_v2i8(ptr addrspace(1) %out0, ptr addrspace(1)
 ; GFX10-NEXT:    v_add_nc_u32_e32 v5, 1, v1
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, s3, v2
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v3, s0, v3
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v6, s2, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s2, v2
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v6, s2, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s1, v3
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v7, s1, v3
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
@@ -2377,8 +2377,8 @@ define amdgpu_kernel void @udivrem_v2i16(ptr addrspace(1) %out0, ptr addrspace(1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v6, 1, v1
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v2, s3, v2
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v3, s0, v3
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s2, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s2, v2
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v5, s2, v2
 ; GFX10-NEXT:    v_cmp_le_u32_e64 s0, s1, v3
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc_lo
 ; GFX10-NEXT:    v_subrev_nc_u32_e32 v4, s1, v3
@@ -2508,8 +2508,8 @@ define amdgpu_kernel void @udivrem_i3(ptr addrspace(1) %out0, ptr addrspace(1) %
 ; GFX10-NEXT:    v_add_nc_u32_e32 v2, 1, v0
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v1, s0, v1
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v3, s4, v1
 ; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s4, v1
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v3, s4, v1
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc_lo
 ; GFX10-NEXT:    v_add_nc_u32_e32 v2, 1, v0
@@ -2629,8 +2629,8 @@ define amdgpu_kernel void @udivrem_i27(ptr addrspace(1) %out0, ptr addrspace(1)
 ; GFX10-NEXT:    v_add_nc_u32_e32 v2, 1, v0
 ; GFX10-NEXT:    v_sub_nc_u32_e32 v1, s0, v1
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GFX10-NEXT:    v_subrev_nc_u32_e32 v3, s4, v1
 ; GFX10-NEXT:    v_cmp_le_u32_e32 vcc_lo, s4, v1
+; GFX10-NEXT:    v_subrev_nc_u32_e32 v3, s4, v1
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc_lo
 ; GFX10-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc_lo
 ; GFX10-NEXT:    v_add_nc_u32_e32 v2, 1, v0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
index 51d5253f87920..f6a228614a27e 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
@@ -2352,39 +2352,39 @@ define <2 x i64> @v_urem_v2i64_24bit(<2 x i64> %num, <2 x i64> %den) {
 ; GISEL-NEXT:    v_mul_lo_u32 v18, v10, v7
 ; GISEL-NEXT:    v_mul_hi_u32 v19, v9, v7
 ; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v15, v6
-; GISEL-NEXT:    v_add_i32_e32 v13, vcc, v18, v13
-; GISEL-NEXT:    v_mul_lo_u32 v15, v11, v17
-; GISEL-NEXT:    v_mul_hi_u32 v18, v7, v17
-; GISEL-NEXT:    v_add_i32_e32 v13, vcc, v13, v19
-; GISEL-NEXT:    v_mul_lo_u32 v19, v7, v13
-; GISEL-NEXT:    v_add_i32_e32 v15, vcc, v15, v19
-; GISEL-NEXT:    v_cndmask_b32_e64 v19, 0, 1, vcc
-; GISEL-NEXT:    v_add_i32_e32 v15, vcc, v15, v18
 ; GISEL-NEXT:    v_mul_lo_u32 v15, v8, v14
+; GISEL-NEXT:    v_add_i32_e32 v13, vcc, v18, v13
 ; GISEL-NEXT:    v_mul_hi_u32 v18, v12, v14
 ; GISEL-NEXT:    v_mul_hi_u32 v14, v8, v14
-; GISEL-NEXT:    v_mul_hi_u32 v17, v11, v17
-; GISEL-NEXT:    v_add_i32_e64 v16, s[4:5], v6, v16
+; GISEL-NEXT:    v_add_i32_e32 v16, vcc, v6, v16
 ; GISEL-NEXT:    v_mul_lo_u32 v6, v12, v16
-; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v15, v6
-; GISEL-NEXT:    v_cndmask_b32_e64 v15, 0, 1, s[4:5]
+; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v15, v6
+; GISEL-NEXT:    v_cndmask_b32_e64 v15, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v6, v18
+; GISEL-NEXT:    v_mul_lo_u32 v6, v11, v17
+; GISEL-NEXT:    v_mul_hi_u32 v18, v7, v17
+; GISEL-NEXT:    v_mul_hi_u32 v17, v11, v17
+; GISEL-NEXT:    v_add_i32_e64 v13, s[4:5], v13, v19
+; GISEL-NEXT:    v_mul_lo_u32 v19, v7, v13
+; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v6, v19
+; GISEL-NEXT:    v_cndmask_b32_e64 v19, 0, 1, s[4:5]
 ; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v6, v18
 ; GISEL-NEXT:    v_mul_lo_u32 v6, v8, v16
-; GISEL-NEXT:    v_cndmask_b32_e64 v18, 0, 1, s[4:5]
-; GISEL-NEXT:    v_add_i32_e64 v15, s[4:5], v15, v18
+; GISEL-NEXT:    v_mul_lo_u32 v18, v11, v13
+; GISEL-NEXT:    v_add_i32_e64 v17, s[6:7], v18, v17
+; GISEL-NEXT:    v_cndmask_b32_e64 v18, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v15, vcc, v15, v18
 ; GISEL-NEXT:    v_mul_hi_u32 v18, v12, v16
-; GISEL-NEXT:    v_add_i32_e64 v6, s[4:5], v6, v14
-; GISEL-NEXT:    v_cndmask_b32_e64 v14, 0, 1, s[4:5]
-; GISEL-NEXT:    v_add_i32_e64 v18, s[4:5], v6, v18
-; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
-; GISEL-NEXT:    v_add_i32_e64 v14, s[4:5], v14, v6
+; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v6, v14
+; GISEL-NEXT:    v_cndmask_b32_e64 v14, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v18, vcc, v6, v18
 ; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, vcc
+; GISEL-NEXT:    v_add_i32_e32 v14, vcc, v14, v6
+; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[4:5]
 ; GISEL-NEXT:    v_add_i32_e32 v19, vcc, v19, v6
-; GISEL-NEXT:    v_mul_lo_u32 v6, v11, v13
-; GISEL-NEXT:    v_add_i32_e32 v6, vcc, v6, v17
-; GISEL-NEXT:    v_mul_hi_u32 v17, v7, v13
-; GISEL-NEXT:    v_cndmask_b32_e64 v20, 0, 1, vcc
-; GISEL-NEXT:    v_add_i32_e32 v17, vcc, v6, v17
+; GISEL-NEXT:    v_mul_hi_u32 v6, v7, v13
+; GISEL-NEXT:    v_cndmask_b32_e64 v20, 0, 1, s[6:7]
+; GISEL-NEXT:    v_add_i32_e32 v17, vcc, v17, v6
 ; GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, vcc
 ; GISEL-NEXT:    v_add_i32_e32 v20, vcc, v20, v6
 ; GISEL-NEXT:    v_and_b32_e32 v6, 0xffffff, v0
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
index 2d19f9702e6ba..701b752e4aa74 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
@@ -1268,12 +1268,12 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(5)
 ; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(4)
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr51
@@ -1701,7 +1701,6 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, 3, v58
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, 3, v59
@@ -2490,8 +2489,8 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    ; kill: killed $vgpr39
@@ -3146,13 +3145,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 40, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v12, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 44, v0
@@ -3172,13 +3172,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v14, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 52, v0
@@ -3198,13 +3199,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 56, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v16, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 60, v0
@@ -3224,13 +3226,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 64, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v18, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x44, v0
@@ -3250,13 +3253,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x4c, v0
@@ -3276,13 +3280,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x50, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x54, v0
@@ -3302,13 +3307,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x58, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x5c, v0
@@ -3328,13 +3334,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x60, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x64, v0
@@ -3352,13 +3359,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x68, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x6c, v0
@@ -3378,13 +3386,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x70, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x74, v0
@@ -3450,8 +3459,8 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    ; implicit-def: $vgpr40
 ; GFX9-NEXT:    ; kill: killed $vgpr40
@@ -3641,7 +3650,7 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(30)
+; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    ; kill: killed $vgpr33
@@ -3650,7 +3659,6 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB6_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
-; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 24, v32
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
@@ -3846,7 +3854,6 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB6_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
-; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_add_u32_e32 v32, 3, v32
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
 ; GFX9-NEXT:    v_add_u32_e32 v31, 3, v31
@@ -4263,13 +4270,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:72
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:76
@@ -4289,13 +4297,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:80
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:84
@@ -4315,13 +4324,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:88
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:92
@@ -4341,13 +4351,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:96
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:100
@@ -4367,13 +4378,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:104
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:108
@@ -4393,13 +4405,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:112
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:116
@@ -4419,13 +4432,14 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:120
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v32, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:124
@@ -5738,154 +5752,147 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v55, v0
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:92
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:388
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:60
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:20
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:16
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v48, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:388
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:116
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v12
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 24, v10
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v5
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v20
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v20
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v18
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v7
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
@@ -5893,295 +5900,304 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:140
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:148
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:168
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:164
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:164
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:172
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:200
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:196
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:204
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:216
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:244
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:276
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:308
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:340
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:368
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:380
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:384
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:368
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:376
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 24, v0
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v1
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:372
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:76
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:12
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:372
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:364
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 24, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v4
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB7_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v39
+; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v55
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v38
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v56
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v48
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v3, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v38
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v56
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v47
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v54
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v54
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v47
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v46
@@ -6189,101 +6205,99 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v37
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v53
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v43
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v36
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v45
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
+; GCN-NEXT:    v_or_b32_e32 v8, v8, v37
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v53
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v45
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v42
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v40
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v52
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v63
@@ -6291,217 +6305,221 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v50
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v58
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v51
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v39
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v49
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v60
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v51
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v58
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v50
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v62
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v61
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v59
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v61
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v60
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v34
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v52
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v62
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v35
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v33
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v44
-; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v57
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v43
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v50, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v51, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v52, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v36, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v36, v37, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v37
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v37, v38, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v38, v39, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xff, v39
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v48, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v49, v39
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v48, v55, v48
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v48, v49, v48
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xff, v49
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v49, v54, v49
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
@@ -6532,6 +6550,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v39
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v48
 ; GCN-NEXT:    v_or_b32_e32 v31, v31, v49
+; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -6564,13 +6583,14 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -6655,37 +6675,34 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr42
@@ -6694,13 +6711,13 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -6715,19 +6732,20 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr59
@@ -6736,297 +6754,298 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; kill: killed $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; kill: killed $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:  .LBB7_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB7_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v0
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v39, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v38, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v56, v1
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v48, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v38, v2
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v56, v2
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v47, v3
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v54, v3
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v54, v4
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v47, v4
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v46, v5
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v37, v6
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v6, v53, v6
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v43, v7
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v48
+; GCN-NEXT:    v_or_b32_e32 v7, v36, v7
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v45, v8
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v37, v8
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v53, v9
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v45, v9
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v42, v10
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v41, v11
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v40, v12
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v52, v12
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v63, v13
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v50, v14
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v58, v14
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v0, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v0, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v0, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v51, v18
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v39, v18
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v49, v19
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v0, v19
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v60, v20
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v51, v20
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v58, v21
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v50, v21
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v25, v62, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v61, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v29, v59, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v37, v32, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v36, v32, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v50, v61, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v60, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v41, v34, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v45, v52, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v45, v62, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v56, v35, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v58, v33, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v59, v44, v22
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v57
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v57, v36, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v60, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v43, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v61, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v62, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v63, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v63, v0, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v38, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v37, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v38, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v49, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_mov_b32_e32 v0, v49
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v51, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v49, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v51, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v52, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v52, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v57
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v54, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v54, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v57, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
 ; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
@@ -7034,15 +7053,15 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v24
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v26, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v26
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v27, v26
 ; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
@@ -7050,151 +7069,150 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v28, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v30, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v31, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v35
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v48, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v48, vcc, 3, v48
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v48, v53, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, 3, v53
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_mov_b32_e32 v0, v55
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v55, v53
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v55
 ; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v55, v40, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v40
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v40, v42, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 3, v42
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xff, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v42, v43, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v43
 ; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v43, v44, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v44
 ; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v44, v46, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v0, v44
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v46, vcc, 3, v46
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xff, v46
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v46, v0, v46
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v47, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v47, v0, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v60, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v61, v0
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v61, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v62, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v62, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v63, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v63, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v22, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
@@ -7214,16 +7232,16 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, s7, v37
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v50
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v41
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v45
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v56
-; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v58
-; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v59
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, 0x300, v57
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, s7, v25
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v36
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, s7, v50
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v41
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v45
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v56
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v58
+; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v59
+; GCN-NEXT:    v_add_i32_e32 v59, vcc, 0x300, v60
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -7242,24 +7260,24 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v36
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff, v50
 ; GCN-NEXT:    v_and_b32_e32 v41, 0xffff, v41
 ; GCN-NEXT:    v_and_b32_e32 v45, 0xffff, v45
 ; GCN-NEXT:    v_and_b32_e32 v56, 0xffff, v56
 ; GCN-NEXT:    v_and_b32_e32 v58, 0xffff, v58
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xffff, v59
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff, v57
-; GCN-NEXT:    v_or_b32_e32 v4, v36, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v37, v4
 ; GCN-NEXT:    v_or_b32_e32 v5, v38, v5
 ; GCN-NEXT:    v_or_b32_e32 v6, v39, v6
 ; GCN-NEXT:    v_or_b32_e32 v7, v49, v7
 ; GCN-NEXT:    v_or_b32_e32 v8, v51, v8
 ; GCN-NEXT:    v_or_b32_e32 v9, v52, v9
 ; GCN-NEXT:    v_or_b32_e32 v10, v54, v10
-; GCN-NEXT:    v_or_b32_e32 v11, v22, v11
+; GCN-NEXT:    v_or_b32_e32 v11, v57, v11
 ; GCN-NEXT:    v_or_b32_e32 v12, v23, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v24, v13
 ; GCN-NEXT:    v_or_b32_e32 v14, v26, v14
@@ -7270,16 +7288,16 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
 ; GCN-NEXT:    v_or_b32_e32 v20, v33, v20
 ; GCN-NEXT:    v_or_b32_e32 v21, v34, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v35, v25
-; GCN-NEXT:    v_or_b32_e32 v23, v48, v29
-; GCN-NEXT:    v_or_b32_e32 v24, v53, v37
-; GCN-NEXT:    v_or_b32_e32 v25, v55, v50
-; GCN-NEXT:    v_or_b32_e32 v26, v40, v41
-; GCN-NEXT:    v_or_b32_e32 v27, v42, v45
-; GCN-NEXT:    v_or_b32_e32 v28, v43, v56
-; GCN-NEXT:    v_or_b32_e32 v29, v44, v58
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v59
-; GCN-NEXT:    v_or_b32_e32 v31, v47, v57
+; GCN-NEXT:    v_or_b32_e32 v22, v35, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v48, v25
+; GCN-NEXT:    v_or_b32_e32 v24, v53, v29
+; GCN-NEXT:    v_or_b32_e32 v25, v55, v36
+; GCN-NEXT:    v_or_b32_e32 v26, v40, v50
+; GCN-NEXT:    v_or_b32_e32 v27, v42, v41
+; GCN-NEXT:    v_or_b32_e32 v28, v43, v45
+; GCN-NEXT:    v_or_b32_e32 v29, v44, v56
+; GCN-NEXT:    v_or_b32_e32 v30, v46, v58
+; GCN-NEXT:    v_or_b32_e32 v31, v47, v59
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -7385,19 +7403,18 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -7429,59 +7446,61 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -7490,25 +7509,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -7516,25 +7535,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -7542,25 +7561,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -7568,10 +7587,10 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -7583,10 +7602,10 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -7597,22 +7616,22 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -7623,14 +7642,14 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -7642,34 +7661,33 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB7_2
 ; VI-NEXT:  ; %bb.1: ; %cmp.false
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -7677,19 +7695,19 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -7700,23 +7718,23 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr55
+; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr38
@@ -7757,147 +7775,217 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; kill: killed $vgpr33
 ; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr54
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -7907,16 +7995,16 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -7925,7 +8013,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -7934,253 +8022,190 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr33
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:  .LBB7_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB7_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v31, 0x300
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v61
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_add_u16_e32 v8, 3, v8
+; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v1, 3, v1
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v1
+; VI-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; VI-NEXT:    v_add_u16_sdwa v1, v1, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -8189,31 +8214,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v1, v2, v3
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(13)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(11)
+; VI-NEXT:    s_waitcnt vmcnt(10)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -8250,13 +8269,13 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v63
+; VI-NEXT:    v_add_u16_e32 v8, 3, v62
 ; VI-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v59
+; VI-NEXT:    v_add_u16_e32 v9, 3, v32
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v62
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
@@ -8264,27 +8283,28 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
 ; VI-NEXT:    v_add_u16_e32 v10, 3, v58
 ; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v60
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v57
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v46
+; VI-NEXT:    v_add_u16_e32 v12, 3, v47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -8295,148 +8315,149 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v54
+; VI-NEXT:    v_add_u16_e32 v14, 3, v42
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v40
-; VI-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v15
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v15, v15, v16
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v17, v17, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v16, v16, v17
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v18, v18, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v17, v17, v18
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v19, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v18, v18, v19
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v20, v20, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v19, v19, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v21, v21, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v20, v20, v21
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v22, v22, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v21, v21, v22
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v23, v23, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v23
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v24, v24, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v23, v23, v24
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
 ; VI-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v25, v25, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v24, v24, v25
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -8450,19 +8471,19 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v26, v26, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v25, v25, v26
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v26, 3, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
 ; VI-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v27, v27, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v27
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -8476,20 +8497,20 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v28, v28, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v28
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v28, 3, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v29, v29, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v29
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -8502,7 +8523,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v30
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v30, 3, v30
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -8515,7 +8536,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v30, v30, v32
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v32, 3, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -8524,7 +8545,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v32, 0x300, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v33, 3, v33
-; VI-NEXT:    v_or_b32_sdwa v33, v34, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v33, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v31, v33, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v31, v32, v31
 ; VI-NEXT:  .LBB7_4: ; %end
@@ -8601,19 +8622,18 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -8625,93 +8645,95 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v29
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
@@ -8721,25 +8743,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -8748,25 +8770,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -8775,25 +8797,25 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -8802,10 +8824,10 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -8817,10 +8839,10 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -8832,22 +8854,22 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -8859,14 +8881,14 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -8878,34 +8900,33 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB7_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -8913,20 +8934,20 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -8937,23 +8958,23 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr38
@@ -8994,147 +9015,217 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; kill: killed $vgpr33
 ; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -9144,16 +9235,16 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -9162,7 +9253,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -9171,258 +9262,196 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr33
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:  .LBB7_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB7_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(13)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(33)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
+; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(23)
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_add_u16_sdwa v1, v1, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -9431,32 +9460,26 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_e32 v1, v2, v3
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v4, v4, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v5, 3, v5
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(12)
+; GFX9-NEXT:    s_waitcnt vmcnt(11)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -9493,41 +9516,41 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v32
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v57
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -9538,148 +9561,149 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v54
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v40
-; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v17, v17, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v16, v16, v17
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v18, v18, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v17, v17, v18
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v19, v19, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v18, v18, v19
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v20, v20, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v19, v19, v20
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v21, v21, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v20, v20, v21
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v22, v22, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v21, v21, v22
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v23, v23, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v22, v22, v23
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v24, v24, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v23, v23, v24
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v25, v25, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v24, v24, v25
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -9693,19 +9717,19 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v26, v26, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v25, v25, v26
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v26, 3, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v27, v27, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v26, v26, v27
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -9719,20 +9743,20 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v28, v28, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v27, v27, v28
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v28, 3, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v29, v29, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v28, v28, v29
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -9745,7 +9769,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v30, v30, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v29, v29, v30
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v30, 3, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -9758,7 +9782,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v31, v31, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v30, v30, v31
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -9767,7 +9791,7 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v32, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v32, v33, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v32, v63, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v32, v32, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v31, v31, v32
 ; GFX9-NEXT:  .LBB7_4: ; %end
@@ -11807,11 +11831,11 @@ define <64 x bfloat> @bitcast_v32i32_to_v64bf16(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v32
 ; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -11914,7 +11938,7 @@ define <64 x bfloat> @bitcast_v32i32_to_v64bf16(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB8_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1) expcnt(0)
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v62
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v62
 ; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
@@ -12680,29 +12704,27 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:72
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32
 ; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v1
 ; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v0
 ; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v3
@@ -12713,68 +12735,62 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v46, 1.0, v9
 ; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v8
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v11
 ; GCN-NEXT:    v_mul_f32_e32 v45, 1.0, v10
-; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -12784,69 +12800,79 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
 ; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v33
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v5
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v42
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v44
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v55
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v40
+; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v51
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v52
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v48
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v50
-; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v50
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v38
+; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v44
+; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v63
 ; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
-; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v63
+; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v42
 ; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
-; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v43
+; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v55
 ; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
-; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v41
+; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v54
 ; GCN-NEXT:    v_mul_f32_e32 v49, 1.0, v49
-; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v54
+; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v31
-; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v54, 1.0, v0
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v1
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v2
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v0
+; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -12863,91 +12889,87 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v3, v3, v57, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v46
 ; GCN-NEXT:    v_alignbit_b32 v4, v4, v47, 16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v43
 ; GCN-NEXT:    v_alignbit_b32 v5, v5, v45, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v6, v6, v7, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v6, v6, v41, 16
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v7, v7, v8, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v8, v8, v9, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v9, v9, v10, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v10, v10, v11, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v11, v11, v12, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v12, v12, v13, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v13, v13, v14, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v14, v14, v15, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v34
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v15, v15, v16, 16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v16, v16, v33, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v17, v17, v18, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
 ; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v36
@@ -12955,30 +12977,34 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v39
 ; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v49
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v52
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v54
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v41
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v43
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v54
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v55
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v42
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v44
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v18, v18, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v19, v19, v42, 16
-; GCN-NEXT:    v_alignbit_b32 v20, v20, v44, 16
+; GCN-NEXT:    v_alignbit_b32 v19, v19, v40, 16
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v20, v20, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v21, v21, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v22, v22, v48, 16
-; GCN-NEXT:    v_alignbit_b32 v23, v23, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v22, v22, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v23, v23, v48, 16
 ; GCN-NEXT:    v_alignbit_b32 v24, v24, v50, 16
 ; GCN-NEXT:    v_alignbit_b32 v25, v25, v51, 16
-; GCN-NEXT:    v_alignbit_b32 v26, v26, v53, 16
-; GCN-NEXT:    v_alignbit_b32 v27, v27, v55, 16
-; GCN-NEXT:    v_alignbit_b32 v28, v28, v40, 16
-; GCN-NEXT:    v_alignbit_b32 v29, v29, v63, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v26, v26, v52, 16
+; GCN-NEXT:    v_alignbit_b32 v27, v27, v53, 16
+; GCN-NEXT:    v_alignbit_b32 v28, v28, v63, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v29, v29, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v30, v30, v32, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v31, v31, v32, 16
 ; GCN-NEXT:    ; implicit-def: $vgpr62
@@ -12991,13 +13017,11 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -13046,10 +13070,11 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
@@ -13057,26 +13082,27 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; kill: killed $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; kill: killed $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:  .LBB9_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB9_4
@@ -13113,104 +13139,100 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_alignbit_b32 v4, v5, v4, 16
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v43
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_alignbit_b32 v5, v6, v5, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v41
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_alignbit_b32 v6, v7, v6, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_alignbit_b32 v7, v8, v7, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_alignbit_b32 v8, v9, v8, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_alignbit_b32 v9, v10, v9, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_alignbit_b32 v10, v11, v10, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_alignbit_b32 v11, v12, v11, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_alignbit_b32 v12, v13, v12, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_alignbit_b32 v13, v14, v13, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_alignbit_b32 v14, v15, v14, 16
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v34
@@ -13219,69 +13241,73 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v15, v16, v15, 16
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v33
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_alignbit_b32 v16, v17, v16, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_alignbit_b32 v17, v18, v17, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v42
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v40
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v44
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v48
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v38
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v38
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v48
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v36
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v50
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v35
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v51
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff0000, v37
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v53
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v52
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v39
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v55
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v53
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v49
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v40
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v63
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v54
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v54
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v48
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v55
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v50
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v41
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v42
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v43
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v44
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v20
@@ -14974,43 +15000,42 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v37, v18, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v20, 0xffff, v20, v32
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v37, 0x40c00000, v38 :: v_dual_cndmask_b32 v34, v34, v36
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v36, 0x400000, v18
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v38, 16, v16
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v48, 0x400000, v37
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v19, 0xffff, v19, v33
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v18, v35, v36, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v38
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v38, v17, 16, 1
+; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v35, v37, 16, 1
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v39, v36, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v38, v17, 0x7fff
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v51, 0x400000, v36
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v35, v37, 0x7fff
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v39, v39, v36, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v17, v38, v49, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v16
-; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v18.l, v18.h
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v17.l, v17.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v36, v39, v51, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v18, 0xffff, v18, v34
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v36.l, v36.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v35, v35, v48, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v17, 0xffff, v17, v35
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v16, v38, v49, vcc_lo
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v16, 0xffff, v36, v16
 ; GFX11-TRUE16-NEXT:  .LBB9_2: ; %end
 ; GFX11-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s0
@@ -15171,15 +15196,15 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v34, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v37, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v33, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v34, 0x40c00000, v34
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v34, 0x40c00000, v34 :: v_dual_add_f32 v35, 0x40c00000, v37
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v35, 0x40c00000, v37
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v38, v34, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v32, v35, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v36, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
@@ -15528,17 +15553,16 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v39, 16, v16
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v18, v36, v37, vcc_lo
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v39
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v38, v35, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v38, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v39, v17, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v48, v36, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v49, 0x400000, v36
-; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v18, v18, v34, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v35, v37, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v38, v39, v17, 0x7fff
@@ -15546,7 +15570,7 @@ define <32 x i32> @bitcast_v64bf16_to_v32i32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v48, v48, v36, 0x7fff
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v38, v39, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v37, v16, 0x7fff
@@ -15598,12 +15622,12 @@ define <64 x half> @bitcast_v32i32_to_v64f16(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -15707,7 +15731,6 @@ define <64 x half> @bitcast_v32i32_to_v64f16(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB10_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v62
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:132 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
@@ -16539,28 +16562,26 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v0
@@ -16573,67 +16594,61 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v9
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v10
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v49
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -16643,28 +16658,27 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v48
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v4
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v43
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v41
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v41
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v55
@@ -16672,46 +16686,58 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v53
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v54
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v54
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v50
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v52
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v48
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v39
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v49
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v35
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v37
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v33
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v31
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v0
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v3
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v7
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB11_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v61
 ; GCN-NEXT:    v_or_b32_e32 v0, v62, v0
@@ -16723,123 +16749,123 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v47
 ; GCN-NEXT:    v_or_b32_e32 v4, v46, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v5, v44, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v44
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v38
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v32
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v32
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v38
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v43, v19
-; GCN-NEXT:    v_or_b32_e32 v20, v41, v20
-; GCN-NEXT:    v_or_b32_e32 v21, v55, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v49, v22
-; GCN-NEXT:    v_or_b32_e32 v23, v50, v23
-; GCN-NEXT:    v_or_b32_e32 v24, v39, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v41, v19
+; GCN-NEXT:    v_or_b32_e32 v20, v55, v20
+; GCN-NEXT:    v_or_b32_e32 v21, v53, v21
+; GCN-NEXT:    v_or_b32_e32 v22, v50, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v39, v23
+; GCN-NEXT:    v_or_b32_e32 v24, v35, v24
 ; GCN-NEXT:    v_or_b32_e32 v25, v36, v25
-; GCN-NEXT:    v_or_b32_e32 v26, v48, v26
+; GCN-NEXT:    v_or_b32_e32 v26, v49, v26
 ; GCN-NEXT:    v_or_b32_e32 v27, v52, v27
-; GCN-NEXT:    v_or_b32_e32 v28, v53, v28
-; GCN-NEXT:    v_or_b32_e32 v29, v54, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v40, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v42, v31
+; GCN-NEXT:    v_or_b32_e32 v28, v54, v28
+; GCN-NEXT:    v_or_b32_e32 v29, v40, v29
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -16851,10 +16877,8 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -16892,6 +16916,7 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -16904,42 +16929,44 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:  .LBB11_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB11_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v63
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v62
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v61
@@ -16981,19 +17008,15 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v44
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
@@ -17002,10 +17025,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
@@ -17014,10 +17037,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
@@ -17026,10 +17049,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
@@ -17038,10 +17061,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
@@ -17050,10 +17073,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
@@ -17062,10 +17085,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
@@ -17074,10 +17097,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
@@ -17086,10 +17109,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
@@ -17099,7 +17122,7 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
@@ -17108,10 +17131,8 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
@@ -17120,10 +17141,10 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
@@ -17132,51 +17153,56 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v41
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v41
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v55
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v55
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v53
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v49
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v50
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v50
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v39
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v39
-; GCN-NEXT:    v_mov_b32_e32 v39, v32
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v35
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v36
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v39
-; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v48
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v49
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v52
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v53
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v54
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v35
-; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v40
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v40
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
@@ -17192,8 +17218,8 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
 ; GCN-NEXT:    v_add_f32_e32 v32, 0x38000000, v32
+; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
 ; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
-; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
 ; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
@@ -17201,9 +17227,9 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
 ; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
-; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
-; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
 ; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
+; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
+; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
@@ -17220,18 +17246,18 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v30
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v31
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v36
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v48
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v49
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v50
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v51
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
@@ -17241,12 +17267,12 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_or_b32_e32 v19, v21, v20
 ; GCN-NEXT:    v_or_b32_e32 v20, v55, v39
@@ -17255,12 +17281,12 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v50
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v51
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v36
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v38
+; GCN-NEXT:    v_or_b32_e32 v26, v26, v35
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v36
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v33
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v34
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v35
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v37
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v37
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v38
 ; GCN-NEXT:  .LBB11_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Reload
@@ -17279,7 +17305,7 @@ define <32 x i32> @bitcast_v64f16_to_v32i32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: bitcast_v64f16_to_v32i32:
@@ -17536,10 +17562,10 @@ define <64 x i16> @bitcast_v32i32_to_v64i16(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -17650,7 +17676,6 @@ define <64 x i16> @bitcast_v32i32_to_v64i16(<32 x i32> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v31
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v32
@@ -18068,11 +18093,11 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mov_b32_e32 v37, v20
 ; GCN-NEXT:    v_mov_b32_e32 v38, v18
 ; GCN-NEXT:    v_mov_b32_e32 v39, v16
@@ -18084,127 +18109,128 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v53, v4
 ; GCN-NEXT:    v_mov_b32_e32 v54, v2
 ; GCN-NEXT:    v_mov_b32_e32 v55, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
+; GCN-NEXT:    s_waitcnt expcnt(6)
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(5)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(12) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v6
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v11
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v7
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -18213,132 +18239,132 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v55
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v54
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v36
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v59
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v58
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v53
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v57
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v52
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v35
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v56
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v51
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v60
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v50
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v49
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v48
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v39
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v38
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v43
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v46
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v45
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v46
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v45
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v32
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v34
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v42
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v41
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v40
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v63
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v62
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v47
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v33
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v44
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v44
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v43
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v42
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v41
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v63
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v62
+; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v61
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v60
+; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v32
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v33
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v35
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v34
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v47
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v19, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v21, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v22, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v27, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v32
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v59
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr53
@@ -18360,81 +18386,81 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:  .LBB13_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB13_4
@@ -18442,7 +18468,7 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v36, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v59, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v54
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v58, v1
@@ -18451,7 +18477,7 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v2, v57, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v52
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v35, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v56, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x30000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v51
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v50
@@ -18460,39 +18486,37 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v39
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v38
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v15
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v43
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v56
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v46
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v45
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v32
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v34
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v42
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v41
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v40
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v47
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v33
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v46
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v45
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v42
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v41
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v40
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v60
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v32
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v33
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v35
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v34
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -18521,86 +18545,88 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v4, v60, v4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v32, v8
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v32, v9
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v32, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v32, v11
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v32, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v32, v13
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v32, v15
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v32, v25
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v32, v27
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v32, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v59, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v36, v31
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x30000, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -18642,9 +18668,7 @@ define <32 x i32> @bitcast_v64i16_to_v32i32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:152 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:156 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:160 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(11)
 ; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:164 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(10)
 ; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:168 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:172 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:176 ; 4-byte Folded Reload
@@ -19697,12 +19721,12 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(5)
 ; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(4)
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr51
@@ -20130,7 +20154,6 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v27, 1.0, v27
 ; GCN-NEXT:    v_add_f32_e32 v30, 1.0, v30
 ; GCN-NEXT:    v_add_f32_e32 v29, 1.0, v29
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_f32_e32 v58, 1.0, v58
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v59, 1.0, v59
@@ -20919,8 +20942,8 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    ; kill: killed $vgpr39
@@ -21575,13 +21598,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 40, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v12, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 44, v0
@@ -21601,13 +21625,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v14, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 52, v0
@@ -21627,13 +21652,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 56, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v16, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 60, v0
@@ -21653,13 +21679,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 64, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v18, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x44, v0
@@ -21679,13 +21706,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x4c, v0
@@ -21705,13 +21733,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x50, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x54, v0
@@ -21731,13 +21760,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x58, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x5c, v0
@@ -21757,13 +21787,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x60, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x64, v0
@@ -21781,13 +21812,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x68, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x6c, v0
@@ -21807,13 +21839,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x70, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x74, v0
@@ -21879,8 +21912,8 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    ; implicit-def: $vgpr40
 ; GFX9-NEXT:    ; kill: killed $vgpr40
@@ -22070,7 +22103,7 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(30)
+; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    ; kill: killed $vgpr33
@@ -22079,7 +22112,6 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB18_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
-; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 24, v32
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
@@ -22275,7 +22307,6 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB18_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
-; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_add_f32_e32 v32, 1.0, v32
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
 ; GFX9-NEXT:    v_add_f32_e32 v31, 1.0, v31
@@ -22692,13 +22723,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:72
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:76
@@ -22718,13 +22750,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:80
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:84
@@ -22744,13 +22777,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:88
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:92
@@ -22770,13 +22804,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:96
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:100
@@ -22796,13 +22831,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:104
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:108
@@ -22822,13 +22858,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:112
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:116
@@ -22848,13 +22885,14 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:120
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v32, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:124
@@ -24133,154 +24171,147 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v55, v0
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:92
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:388
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:60
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:20
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:16
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v48, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:388
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:116
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v12
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 24, v10
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v5
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v20
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v20
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v18
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v7
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
@@ -24288,295 +24319,304 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:140
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:148
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:168
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:164
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:164
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:172
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:200
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:196
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:204
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:216
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:244
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:276
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:308
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:340
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:368
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:380
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:384
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:368
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:376
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 24, v0
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v1
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:372
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:76
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:12
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:372
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:364
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 24, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v4
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB19_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v39
+; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v55
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v38
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v56
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v48
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v3, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v38
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v56
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v47
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v54
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v54
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v47
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v46
@@ -24584,101 +24624,99 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v37
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v53
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v43
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v36
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v45
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
+; GCN-NEXT:    v_or_b32_e32 v8, v8, v37
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v53
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v45
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v42
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v40
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v52
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v63
@@ -24686,217 +24724,221 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v50
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v58
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v51
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v39
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v49
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v60
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v51
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v58
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v50
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v62
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v61
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v59
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v61
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v60
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v34
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v52
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v62
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v35
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v33
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v44
-; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v57
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v43
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v50, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v51, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v52, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v36, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v36, v37, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v37
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v37, v38, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v38, v39, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xff, v39
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v48, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v49, v39
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v48, v55, v48
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v48, v49, v48
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xff, v49
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v49, v54, v49
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
@@ -24927,6 +24969,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v39
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v48
 ; GCN-NEXT:    v_or_b32_e32 v31, v31, v49
+; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -24959,13 +25002,14 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -25050,37 +25094,34 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr42
@@ -25089,13 +25130,13 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -25110,19 +25151,20 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr59
@@ -25131,297 +25173,298 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; kill: killed $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; kill: killed $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:  .LBB19_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB19_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v0
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v39, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v38, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v56, v1
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v48, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v38, v2
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v56, v2
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v47, v3
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v54, v3
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v54, v4
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v47, v4
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v46, v5
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v37, v6
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v6, v53, v6
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v43, v7
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v48
+; GCN-NEXT:    v_or_b32_e32 v7, v36, v7
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v45, v8
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v37, v8
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v53, v9
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v45, v9
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v42, v10
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v41, v11
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v40, v12
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v52, v12
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v63, v13
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v50, v14
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v58, v14
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v0, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v0, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v0, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v51, v18
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v39, v18
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v49, v19
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v0, v19
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v60, v20
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v51, v20
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v58, v21
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v50, v21
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v25, v62, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v61, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v29, v59, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v37, v32, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v36, v32, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v50, v61, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v60, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v41, v34, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v45, v52, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v45, v62, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v56, v35, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v58, v33, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v59, v44, v22
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v57
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v57, v36, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v60, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v43, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v61, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v62, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v63, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v63, v0, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v38, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v37, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v38, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v49, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_mov_b32_e32 v0, v49
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v51, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v49, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v51, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v52, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v52, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v57
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v54, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v54, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v57, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
 ; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
@@ -25429,15 +25472,15 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v24
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v26, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v26
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v27, v26
 ; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
@@ -25445,151 +25488,150 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v28, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v30, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v31, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v35
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v48, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v48, vcc, 3, v48
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v48, v53, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, 3, v53
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_mov_b32_e32 v0, v55
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v55, v53
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v55
 ; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v55, v40, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v40
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v40, v42, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 3, v42
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xff, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v42, v43, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v43
 ; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v43, v44, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v44
 ; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v44, v46, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v0, v44
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v46, vcc, 3, v46
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xff, v46
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v46, v0, v46
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v47, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v47, v0, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v60, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v61, v0
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v61, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v62, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v62, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v63, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v63, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v22, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
@@ -25609,16 +25651,16 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, s7, v37
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v50
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v41
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v45
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v56
-; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v58
-; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v59
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, 0x300, v57
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, s7, v25
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v36
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, s7, v50
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v41
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v45
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v56
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v58
+; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v59
+; GCN-NEXT:    v_add_i32_e32 v59, vcc, 0x300, v60
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -25637,24 +25679,24 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v36
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff, v50
 ; GCN-NEXT:    v_and_b32_e32 v41, 0xffff, v41
 ; GCN-NEXT:    v_and_b32_e32 v45, 0xffff, v45
 ; GCN-NEXT:    v_and_b32_e32 v56, 0xffff, v56
 ; GCN-NEXT:    v_and_b32_e32 v58, 0xffff, v58
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xffff, v59
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff, v57
-; GCN-NEXT:    v_or_b32_e32 v4, v36, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v37, v4
 ; GCN-NEXT:    v_or_b32_e32 v5, v38, v5
 ; GCN-NEXT:    v_or_b32_e32 v6, v39, v6
 ; GCN-NEXT:    v_or_b32_e32 v7, v49, v7
 ; GCN-NEXT:    v_or_b32_e32 v8, v51, v8
 ; GCN-NEXT:    v_or_b32_e32 v9, v52, v9
 ; GCN-NEXT:    v_or_b32_e32 v10, v54, v10
-; GCN-NEXT:    v_or_b32_e32 v11, v22, v11
+; GCN-NEXT:    v_or_b32_e32 v11, v57, v11
 ; GCN-NEXT:    v_or_b32_e32 v12, v23, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v24, v13
 ; GCN-NEXT:    v_or_b32_e32 v14, v26, v14
@@ -25665,16 +25707,16 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
 ; GCN-NEXT:    v_or_b32_e32 v20, v33, v20
 ; GCN-NEXT:    v_or_b32_e32 v21, v34, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v35, v25
-; GCN-NEXT:    v_or_b32_e32 v23, v48, v29
-; GCN-NEXT:    v_or_b32_e32 v24, v53, v37
-; GCN-NEXT:    v_or_b32_e32 v25, v55, v50
-; GCN-NEXT:    v_or_b32_e32 v26, v40, v41
-; GCN-NEXT:    v_or_b32_e32 v27, v42, v45
-; GCN-NEXT:    v_or_b32_e32 v28, v43, v56
-; GCN-NEXT:    v_or_b32_e32 v29, v44, v58
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v59
-; GCN-NEXT:    v_or_b32_e32 v31, v47, v57
+; GCN-NEXT:    v_or_b32_e32 v22, v35, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v48, v25
+; GCN-NEXT:    v_or_b32_e32 v24, v53, v29
+; GCN-NEXT:    v_or_b32_e32 v25, v55, v36
+; GCN-NEXT:    v_or_b32_e32 v26, v40, v50
+; GCN-NEXT:    v_or_b32_e32 v27, v42, v41
+; GCN-NEXT:    v_or_b32_e32 v28, v43, v45
+; GCN-NEXT:    v_or_b32_e32 v29, v44, v56
+; GCN-NEXT:    v_or_b32_e32 v30, v46, v58
+; GCN-NEXT:    v_or_b32_e32 v31, v47, v59
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -25780,19 +25822,18 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -25824,59 +25865,61 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -25885,25 +25928,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -25911,25 +25954,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -25937,25 +25980,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -25963,10 +26006,10 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -25978,10 +26021,10 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -25992,22 +26035,22 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -26018,14 +26061,14 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -26037,34 +26080,33 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB19_2
 ; VI-NEXT:  ; %bb.1: ; %cmp.false
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -26072,19 +26114,19 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -26095,23 +26137,23 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr55
+; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr38
@@ -26152,147 +26194,217 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; kill: killed $vgpr33
 ; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr54
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -26302,16 +26414,16 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -26320,7 +26432,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -26329,253 +26441,190 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr33
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:  .LBB19_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB19_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v31, 0x300
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v61
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_add_u16_e32 v8, 3, v8
+; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v1, 3, v1
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v1
+; VI-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; VI-NEXT:    v_add_u16_sdwa v1, v1, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -26584,31 +26633,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v1, v2, v3
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(13)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(11)
+; VI-NEXT:    s_waitcnt vmcnt(10)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -26645,13 +26688,13 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v63
+; VI-NEXT:    v_add_u16_e32 v8, 3, v62
 ; VI-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v59
+; VI-NEXT:    v_add_u16_e32 v9, 3, v32
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v62
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
@@ -26659,27 +26702,28 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
 ; VI-NEXT:    v_add_u16_e32 v10, 3, v58
 ; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v60
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v57
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v46
+; VI-NEXT:    v_add_u16_e32 v12, 3, v47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -26690,148 +26734,149 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v54
+; VI-NEXT:    v_add_u16_e32 v14, 3, v42
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v40
-; VI-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v15
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v15, v15, v16
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v17, v17, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v16, v16, v17
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v18, v18, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v17, v17, v18
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v19, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v18, v18, v19
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v20, v20, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v19, v19, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v21, v21, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v20, v20, v21
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v22, v22, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v21, v21, v22
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v23, v23, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v23
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v24, v24, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v23, v23, v24
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
 ; VI-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v25, v25, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v24, v24, v25
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -26845,19 +26890,19 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v26, v26, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v25, v25, v26
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v26, 3, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
 ; VI-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v27, v27, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v27
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -26871,20 +26916,20 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v28, v28, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v28
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v28, 3, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v29, v29, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v29
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -26897,7 +26942,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v30
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v30, 3, v30
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -26910,7 +26955,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v30, v30, v32
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v32, 3, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -26919,7 +26964,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v32, 0x300, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v33, 3, v33
-; VI-NEXT:    v_or_b32_sdwa v33, v34, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v33, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v31, v33, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v31, v32, v31
 ; VI-NEXT:  .LBB19_4: ; %end
@@ -26996,19 +27041,18 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -27020,93 +27064,95 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v29
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
@@ -27116,25 +27162,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -27143,25 +27189,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -27170,25 +27216,25 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -27197,10 +27243,10 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -27212,10 +27258,10 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -27227,22 +27273,22 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -27254,14 +27300,14 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -27273,34 +27319,33 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB19_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -27308,20 +27353,20 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -27332,23 +27377,23 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr38
@@ -27389,147 +27434,217 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; kill: killed $vgpr33
 ; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -27539,16 +27654,16 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -27557,7 +27672,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -27566,258 +27681,196 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr33
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:  .LBB19_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB19_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(13)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(33)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
+; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(23)
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_add_u16_sdwa v1, v1, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -27826,32 +27879,26 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_e32 v1, v2, v3
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v4, v4, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v5, 3, v5
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(12)
+; GFX9-NEXT:    s_waitcnt vmcnt(11)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -27888,41 +27935,41 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v32
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v57
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -27933,148 +27980,149 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v54
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v40
-; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v17, v17, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v16, v16, v17
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v18, v18, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v17, v17, v18
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v19, v19, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v18, v18, v19
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v20, v20, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v19, v19, v20
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v21, v21, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v20, v20, v21
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v22, v22, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v21, v21, v22
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v23, v23, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v22, v22, v23
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v24, v24, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v23, v23, v24
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v25, v25, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v24, v24, v25
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -28088,19 +28136,19 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v26, v26, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v25, v25, v26
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v26, 3, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v27, v27, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v26, v26, v27
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -28114,20 +28162,20 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v28, v28, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v27, v27, v28
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v28, 3, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v29, v29, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v28, v28, v29
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -28140,7 +28188,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v30, v30, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v29, v29, v30
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v30, 3, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -28153,7 +28201,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v31, v31, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v30, v30, v31
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -28162,7 +28210,7 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v32, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v32, v33, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v32, v63, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v32, v32, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v31, v31, v32
 ; GFX9-NEXT:  .LBB19_4: ; %end
@@ -30202,11 +30250,11 @@ define <64 x bfloat> @bitcast_v32f32_to_v64bf16(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v32
 ; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -30309,7 +30357,7 @@ define <64 x bfloat> @bitcast_v32f32_to_v64bf16(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB20_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1) expcnt(0)
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v62
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v62
 ; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
@@ -31059,29 +31107,27 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:72
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32
 ; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v1
 ; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v0
 ; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v3
@@ -31092,68 +31138,62 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v46, 1.0, v9
 ; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v8
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v11
 ; GCN-NEXT:    v_mul_f32_e32 v45, 1.0, v10
-; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -31163,69 +31203,79 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
 ; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v33
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v5
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v42
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v44
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v55
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v40
+; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v51
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v52
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v48
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v50
-; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v50
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v38
+; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v44
+; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v63
 ; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
-; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v63
+; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v42
 ; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
-; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v43
+; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v55
 ; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
-; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v41
+; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v54
 ; GCN-NEXT:    v_mul_f32_e32 v49, 1.0, v49
-; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v54
+; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v31
-; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v54, 1.0, v0
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v1
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v2
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v0
+; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -31242,91 +31292,87 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v3, v3, v57, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v46
 ; GCN-NEXT:    v_alignbit_b32 v4, v4, v47, 16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v43
 ; GCN-NEXT:    v_alignbit_b32 v5, v5, v45, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v6, v6, v7, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v6, v6, v41, 16
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v7, v7, v8, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v8, v8, v9, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v9, v9, v10, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v10, v10, v11, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v11, v11, v12, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v12, v12, v13, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v13, v13, v14, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v14, v14, v15, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v34
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v15, v15, v16, 16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v16, v16, v33, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v17, v17, v18, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
 ; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v36
@@ -31334,30 +31380,34 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v39
 ; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v49
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v52
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v54
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v41
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v43
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v54
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v55
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v42
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v44
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v18, v18, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v19, v19, v42, 16
-; GCN-NEXT:    v_alignbit_b32 v20, v20, v44, 16
+; GCN-NEXT:    v_alignbit_b32 v19, v19, v40, 16
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v20, v20, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v21, v21, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v22, v22, v48, 16
-; GCN-NEXT:    v_alignbit_b32 v23, v23, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v22, v22, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v23, v23, v48, 16
 ; GCN-NEXT:    v_alignbit_b32 v24, v24, v50, 16
 ; GCN-NEXT:    v_alignbit_b32 v25, v25, v51, 16
-; GCN-NEXT:    v_alignbit_b32 v26, v26, v53, 16
-; GCN-NEXT:    v_alignbit_b32 v27, v27, v55, 16
-; GCN-NEXT:    v_alignbit_b32 v28, v28, v40, 16
-; GCN-NEXT:    v_alignbit_b32 v29, v29, v63, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v26, v26, v52, 16
+; GCN-NEXT:    v_alignbit_b32 v27, v27, v53, 16
+; GCN-NEXT:    v_alignbit_b32 v28, v28, v63, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v29, v29, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v30, v30, v32, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v31, v31, v32, 16
 ; GCN-NEXT:    ; implicit-def: $vgpr62
@@ -31370,13 +31420,11 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -31425,10 +31473,11 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
@@ -31436,26 +31485,27 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; kill: killed $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; kill: killed $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:  .LBB21_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB21_4
@@ -31492,104 +31542,100 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_alignbit_b32 v4, v5, v4, 16
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v43
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_alignbit_b32 v5, v6, v5, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v41
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_alignbit_b32 v6, v7, v6, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_alignbit_b32 v7, v8, v7, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_alignbit_b32 v8, v9, v8, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_alignbit_b32 v9, v10, v9, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_alignbit_b32 v10, v11, v10, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_alignbit_b32 v11, v12, v11, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_alignbit_b32 v12, v13, v12, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_alignbit_b32 v13, v14, v13, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_alignbit_b32 v14, v15, v14, 16
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v34
@@ -31598,69 +31644,73 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v15, v16, v15, 16
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v33
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_alignbit_b32 v16, v17, v16, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_alignbit_b32 v17, v18, v17, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v42
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v40
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v44
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v48
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v38
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v38
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v48
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v36
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v50
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v35
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v51
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff0000, v37
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v53
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v52
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v39
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v55
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v53
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v49
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v40
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v63
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v54
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v54
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v48
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v55
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v50
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v41
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v42
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v43
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v44
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v20
@@ -33353,43 +33403,42 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v37, v18, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v20, 0xffff, v20, v32
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v37, 0x40c00000, v38 :: v_dual_cndmask_b32 v34, v34, v36
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v36, 0x400000, v18
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v38, 16, v16
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v48, 0x400000, v37
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v19, 0xffff, v19, v33
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v18, v35, v36, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v38
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v38, v17, 16, 1
+; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v35, v37, 16, 1
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v39, v36, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v38, v17, 0x7fff
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v51, 0x400000, v36
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v35, v37, 0x7fff
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v39, v39, v36, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v17, v38, v49, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v16
-; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v18.l, v18.h
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v17.l, v17.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v36, v39, v51, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v18, 0xffff, v18, v34
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v36.l, v36.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v35, v35, v48, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v17, 0xffff, v17, v35
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v16, v38, v49, vcc_lo
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v16, 0xffff, v36, v16
 ; GFX11-TRUE16-NEXT:  .LBB21_2: ; %end
 ; GFX11-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s0
@@ -33550,15 +33599,15 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v34, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v37, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v33, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v34, 0x40c00000, v34
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v34, 0x40c00000, v34 :: v_dual_add_f32 v35, 0x40c00000, v37
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v35, 0x40c00000, v37
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v38, v34, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v32, v35, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v36, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
@@ -33907,17 +33956,16 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v39, 16, v16
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v18, v36, v37, vcc_lo
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v39
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v38, v35, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v38, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v39, v17, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v48, v36, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v49, 0x400000, v36
-; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v18, v18, v34, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v35, v37, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v38, v39, v17, 0x7fff
@@ -33925,7 +33973,7 @@ define <32 x float> @bitcast_v64bf16_to_v32f32(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v48, v48, v36, 0x7fff
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v38, v39, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v37, v16, 0x7fff
@@ -33977,12 +34025,12 @@ define <64 x half> @bitcast_v32f32_to_v64f16(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -34086,7 +34134,6 @@ define <64 x half> @bitcast_v32f32_to_v64f16(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB22_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v62
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:132 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
@@ -34902,28 +34949,26 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v0
@@ -34936,67 +34981,61 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v9
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v10
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v49
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -35006,28 +35045,27 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v48
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v4
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v43
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v41
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v41
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v55
@@ -35035,46 +35073,58 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v53
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v54
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v54
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v50
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v52
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v48
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v39
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v49
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v35
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v37
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v33
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v31
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v0
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v3
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v7
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB23_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v61
 ; GCN-NEXT:    v_or_b32_e32 v0, v62, v0
@@ -35086,123 +35136,123 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v47
 ; GCN-NEXT:    v_or_b32_e32 v4, v46, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v5, v44, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v44
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v38
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v32
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v32
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v38
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v43, v19
-; GCN-NEXT:    v_or_b32_e32 v20, v41, v20
-; GCN-NEXT:    v_or_b32_e32 v21, v55, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v49, v22
-; GCN-NEXT:    v_or_b32_e32 v23, v50, v23
-; GCN-NEXT:    v_or_b32_e32 v24, v39, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v41, v19
+; GCN-NEXT:    v_or_b32_e32 v20, v55, v20
+; GCN-NEXT:    v_or_b32_e32 v21, v53, v21
+; GCN-NEXT:    v_or_b32_e32 v22, v50, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v39, v23
+; GCN-NEXT:    v_or_b32_e32 v24, v35, v24
 ; GCN-NEXT:    v_or_b32_e32 v25, v36, v25
-; GCN-NEXT:    v_or_b32_e32 v26, v48, v26
+; GCN-NEXT:    v_or_b32_e32 v26, v49, v26
 ; GCN-NEXT:    v_or_b32_e32 v27, v52, v27
-; GCN-NEXT:    v_or_b32_e32 v28, v53, v28
-; GCN-NEXT:    v_or_b32_e32 v29, v54, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v40, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v42, v31
+; GCN-NEXT:    v_or_b32_e32 v28, v54, v28
+; GCN-NEXT:    v_or_b32_e32 v29, v40, v29
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -35214,10 +35264,8 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -35255,6 +35303,7 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -35267,42 +35316,44 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:  .LBB23_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB23_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v63
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v62
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v61
@@ -35344,19 +35395,15 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v44
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
@@ -35365,10 +35412,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
@@ -35377,10 +35424,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
@@ -35389,10 +35436,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
@@ -35401,10 +35448,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
@@ -35413,10 +35460,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
@@ -35425,10 +35472,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
@@ -35437,10 +35484,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
@@ -35449,10 +35496,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
@@ -35462,7 +35509,7 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
@@ -35471,10 +35518,8 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
@@ -35483,10 +35528,10 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
@@ -35495,51 +35540,56 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v41
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v41
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v55
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v55
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v53
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v49
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v50
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v50
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v39
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v39
-; GCN-NEXT:    v_mov_b32_e32 v39, v32
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v35
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v36
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v39
-; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v48
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v49
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v52
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v53
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v54
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v35
-; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v40
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v40
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
@@ -35555,8 +35605,8 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
 ; GCN-NEXT:    v_add_f32_e32 v32, 0x38000000, v32
+; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
 ; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
-; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
 ; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
@@ -35564,9 +35614,9 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
 ; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
-; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
-; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
 ; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
+; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
+; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
@@ -35583,18 +35633,18 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v30
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v31
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v36
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v48
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v49
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v50
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v51
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
@@ -35604,12 +35654,12 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_or_b32_e32 v19, v21, v20
 ; GCN-NEXT:    v_or_b32_e32 v20, v55, v39
@@ -35618,12 +35668,12 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v50
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v51
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v36
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v38
+; GCN-NEXT:    v_or_b32_e32 v26, v26, v35
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v36
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v33
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v34
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v35
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v37
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v37
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v38
 ; GCN-NEXT:  .LBB23_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Reload
@@ -35642,7 +35692,7 @@ define <32 x float> @bitcast_v64f16_to_v32f32(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: bitcast_v64f16_to_v32f32:
@@ -35899,10 +35949,10 @@ define <64 x i16> @bitcast_v32f32_to_v64i16(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -36013,7 +36063,6 @@ define <64 x i16> @bitcast_v32f32_to_v64i16(<32 x float> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v27, 1.0, v27
 ; GCN-NEXT:    v_add_f32_e32 v30, 1.0, v30
 ; GCN-NEXT:    v_add_f32_e32 v29, 1.0, v29
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_f32_e32 v31, 1.0, v31
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v32, 1.0, v32
@@ -36415,11 +36464,11 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mov_b32_e32 v37, v20
 ; GCN-NEXT:    v_mov_b32_e32 v38, v18
 ; GCN-NEXT:    v_mov_b32_e32 v39, v16
@@ -36431,127 +36480,128 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v53, v4
 ; GCN-NEXT:    v_mov_b32_e32 v54, v2
 ; GCN-NEXT:    v_mov_b32_e32 v55, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
+; GCN-NEXT:    s_waitcnt expcnt(6)
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(5)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(12) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v6
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v11
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v7
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -36560,132 +36610,132 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v55
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v54
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v36
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v59
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v58
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v53
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v57
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v52
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v35
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v56
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v51
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v60
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v50
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v49
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v48
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v39
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v38
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v43
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v46
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v45
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v46
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v45
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v32
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v34
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v42
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v41
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v40
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v63
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v62
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v47
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v33
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v44
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v44
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v43
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v42
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v41
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v63
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v62
+; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v61
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v60
+; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v32
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v33
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v35
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v34
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v47
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v19, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v21, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v22, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v27, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v32
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v59
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr53
@@ -36707,81 +36757,81 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:  .LBB25_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB25_4
@@ -36789,7 +36839,7 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v36, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v59, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v54
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v58, v1
@@ -36798,7 +36848,7 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v2, v57, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v52
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v35, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v56, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x30000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v51
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v50
@@ -36807,39 +36857,37 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v39
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v38
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v15
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v43
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v56
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v46
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v45
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v32
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v34
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v42
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v41
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v40
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v47
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v33
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v46
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v45
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v42
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v41
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v40
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v60
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v32
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v33
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v35
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v34
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -36868,86 +36916,88 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v4, v60, v4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v32, v8
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v32, v9
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v32, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v32, v11
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v32, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v32, v13
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v32, v15
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v32, v25
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v32, v27
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v32, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v59, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v36, v31
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x30000, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -36989,9 +37039,7 @@ define <32 x float> @bitcast_v64i16_to_v32f32(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:152 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:156 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:160 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(11)
 ; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:164 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(10)
 ; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:168 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:172 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:176 ; 4-byte Folded Reload
@@ -37638,12 +37686,11 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
-; GCN-NEXT:    s_waitcnt expcnt(6)
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    s_waitcnt expcnt(4)
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr51
@@ -37684,7 +37731,7 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -37811,13 +37858,13 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    s_cbranch_execz .LBB28_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v31, v57, v59, 24
+; GCN-NEXT:    v_alignbit_b32 v31, v56, v59, 24
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v31, v57, v59, 16
+; GCN-NEXT:    v_alignbit_b32 v31, v56, v59, 16
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v31, v57, v59, 8
+; GCN-NEXT:    v_alignbit_b32 v31, v56, v59, 8
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v31, v30, v29, 24
@@ -37931,12 +37978,12 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v51, v2, v1, 16
 ; GCN-NEXT:    v_alignbit_b32 v52, v2, v1, 8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v56
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v56
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v53, 8, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v53, 8, v56
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v30
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
@@ -38013,7 +38060,7 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v10
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v56, 8, v10
+; GCN-NEXT:    v_lshrrev_b32_e32 v57, 8, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v8
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
@@ -38073,14 +38120,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    v_addc_u32_e32 v30, vcc, 0, v30, vcc
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, 3, v59
-; GCN-NEXT:    v_addc_u32_e32 v57, vcc, 0, v57, vcc
-; GCN-NEXT:    v_alignbit_b32 v31, v57, v59, 24
+; GCN-NEXT:    v_addc_u32_e32 v56, vcc, 0, v56, vcc
+; GCN-NEXT:    v_alignbit_b32 v31, v56, v59, 24
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v31, v57, v59, 16
+; GCN-NEXT:    v_alignbit_b32 v31, v56, v59, 16
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v31, v57, v59, 8
+; GCN-NEXT:    v_alignbit_b32 v31, v56, v59, 8
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v31, v30, v29, 24
@@ -38194,12 +38241,12 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v51, v2, v1, 16
 ; GCN-NEXT:    v_alignbit_b32 v52, v2, v1, 8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v56
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v56
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v53, 8, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v53, 8, v56
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v30
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
@@ -38276,7 +38323,7 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v10
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v56, 8, v10
+; GCN-NEXT:    v_lshrrev_b32_e32 v57, 8, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 24, v8
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
@@ -38351,7 +38398,7 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v10
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v57
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
@@ -38430,7 +38477,7 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v27, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v57
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v56
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v53
 ; GCN-NEXT:    v_or_b32_e32 v28, v1, v3
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
@@ -38859,8 +38906,8 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    ; kill: killed $vgpr39
@@ -39515,13 +39562,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 40, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v12, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 44, v0
@@ -39541,13 +39589,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v14, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 52, v0
@@ -39567,13 +39616,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 56, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v16, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 60, v0
@@ -39593,13 +39643,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 64, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v18, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x44, v0
@@ -39619,13 +39670,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x4c, v0
@@ -39645,13 +39697,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x50, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x54, v0
@@ -39671,13 +39724,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x58, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x5c, v0
@@ -39697,13 +39751,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x60, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x64, v0
@@ -39721,13 +39776,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x68, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x6c, v0
@@ -39747,13 +39803,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x70, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x74, v0
@@ -39819,8 +39876,8 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    ; implicit-def: $vgpr40
 ; GFX9-NEXT:    ; kill: killed $vgpr40
@@ -40010,7 +40067,7 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(30)
+; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    ; kill: killed $vgpr33
@@ -40019,7 +40076,6 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB28_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
-; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 24, v32
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
@@ -40631,13 +40687,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:72
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:76
@@ -40657,13 +40714,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:80
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:84
@@ -40683,13 +40741,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:88
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:92
@@ -40709,13 +40768,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:96
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:100
@@ -40735,13 +40795,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:104
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:108
@@ -40761,13 +40822,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:112
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:116
@@ -40787,13 +40849,14 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:120
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v32, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:124
@@ -42122,154 +42185,147 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v55, v0
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:92
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:388
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:60
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:20
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:16
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v48, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:388
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:116
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v12
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 24, v10
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v5
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v20
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v20
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v18
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v7
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
@@ -42277,295 +42333,304 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:140
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:148
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:168
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:164
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:164
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:172
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:200
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:196
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:204
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:216
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:244
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:276
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:308
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:340
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:368
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:380
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:384
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:368
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:376
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 24, v0
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v1
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:372
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:76
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:12
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:372
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:364
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 24, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v4
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB29_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v39
+; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v55
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v38
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v56
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v48
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v3, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v38
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v56
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v47
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v54
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v54
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v47
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v46
@@ -42573,101 +42638,99 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v37
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v53
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v43
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v36
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v45
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
+; GCN-NEXT:    v_or_b32_e32 v8, v8, v37
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v53
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v45
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v42
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v40
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v52
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v63
@@ -42675,217 +42738,221 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v50
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v58
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v51
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v39
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v49
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v60
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v51
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v58
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v50
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v62
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v61
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v59
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v61
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v60
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v34
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v52
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v62
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v35
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v33
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v44
-; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v57
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v43
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v50, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v51, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v52, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v36, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v36, v37, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v37
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v37, v38, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v38, v39, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xff, v39
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v48, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v49, v39
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v48, v55, v48
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v48, v49, v48
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xff, v49
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v49, v54, v49
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
@@ -42916,6 +42983,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v39
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v48
 ; GCN-NEXT:    v_or_b32_e32 v31, v31, v49
+; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -42948,13 +43016,14 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -43039,37 +43108,34 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr42
@@ -43078,13 +43144,13 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -43099,19 +43165,20 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr59
@@ -43120,297 +43187,298 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; kill: killed $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; kill: killed $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:  .LBB29_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB29_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v0
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v39, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v38, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v56, v1
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v48, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v38, v2
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v56, v2
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v47, v3
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v54, v3
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v54, v4
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v47, v4
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v46, v5
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v37, v6
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v6, v53, v6
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v43, v7
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v48
+; GCN-NEXT:    v_or_b32_e32 v7, v36, v7
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v45, v8
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v37, v8
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v53, v9
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v45, v9
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v42, v10
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v41, v11
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v40, v12
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v52, v12
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v63, v13
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v50, v14
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v58, v14
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v0, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v0, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v0, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v51, v18
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v39, v18
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v49, v19
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v0, v19
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v60, v20
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v51, v20
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v58, v21
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v50, v21
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v25, v62, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v61, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v29, v59, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v37, v32, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v36, v32, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v50, v61, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v60, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v41, v34, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v45, v52, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v45, v62, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v56, v35, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v58, v33, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v59, v44, v22
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v57
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v57, v36, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v60, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v43, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v61, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v62, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v63, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v63, v0, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v38, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v37, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v38, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v49, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_mov_b32_e32 v0, v49
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v51, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v49, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v51, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v52, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v52, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v57
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v54, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v54, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v57, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
 ; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
@@ -43418,15 +43486,15 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v24
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v26, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v26
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v27, v26
 ; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
@@ -43434,151 +43502,150 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v28, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v30, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v31, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v35
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v48, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v48, vcc, 3, v48
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v48, v53, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, 3, v53
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_mov_b32_e32 v0, v55
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v55, v53
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v55
 ; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v55, v40, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v40
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v40, v42, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 3, v42
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xff, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v42, v43, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v43
 ; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v43, v44, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v44
 ; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v44, v46, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v0, v44
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v46, vcc, 3, v46
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xff, v46
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v46, v0, v46
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v47, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v47, v0, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v60, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v61, v0
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v61, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v62, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v62, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v63, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v63, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v22, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
@@ -43598,16 +43665,16 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, s7, v37
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v50
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v41
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v45
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v56
-; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v58
-; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v59
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, 0x300, v57
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, s7, v25
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v36
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, s7, v50
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v41
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v45
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v56
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v58
+; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v59
+; GCN-NEXT:    v_add_i32_e32 v59, vcc, 0x300, v60
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -43626,24 +43693,24 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v36
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff, v50
 ; GCN-NEXT:    v_and_b32_e32 v41, 0xffff, v41
 ; GCN-NEXT:    v_and_b32_e32 v45, 0xffff, v45
 ; GCN-NEXT:    v_and_b32_e32 v56, 0xffff, v56
 ; GCN-NEXT:    v_and_b32_e32 v58, 0xffff, v58
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xffff, v59
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff, v57
-; GCN-NEXT:    v_or_b32_e32 v4, v36, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v37, v4
 ; GCN-NEXT:    v_or_b32_e32 v5, v38, v5
 ; GCN-NEXT:    v_or_b32_e32 v6, v39, v6
 ; GCN-NEXT:    v_or_b32_e32 v7, v49, v7
 ; GCN-NEXT:    v_or_b32_e32 v8, v51, v8
 ; GCN-NEXT:    v_or_b32_e32 v9, v52, v9
 ; GCN-NEXT:    v_or_b32_e32 v10, v54, v10
-; GCN-NEXT:    v_or_b32_e32 v11, v22, v11
+; GCN-NEXT:    v_or_b32_e32 v11, v57, v11
 ; GCN-NEXT:    v_or_b32_e32 v12, v23, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v24, v13
 ; GCN-NEXT:    v_or_b32_e32 v14, v26, v14
@@ -43654,16 +43721,16 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
 ; GCN-NEXT:    v_or_b32_e32 v20, v33, v20
 ; GCN-NEXT:    v_or_b32_e32 v21, v34, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v35, v25
-; GCN-NEXT:    v_or_b32_e32 v23, v48, v29
-; GCN-NEXT:    v_or_b32_e32 v24, v53, v37
-; GCN-NEXT:    v_or_b32_e32 v25, v55, v50
-; GCN-NEXT:    v_or_b32_e32 v26, v40, v41
-; GCN-NEXT:    v_or_b32_e32 v27, v42, v45
-; GCN-NEXT:    v_or_b32_e32 v28, v43, v56
-; GCN-NEXT:    v_or_b32_e32 v29, v44, v58
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v59
-; GCN-NEXT:    v_or_b32_e32 v31, v47, v57
+; GCN-NEXT:    v_or_b32_e32 v22, v35, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v48, v25
+; GCN-NEXT:    v_or_b32_e32 v24, v53, v29
+; GCN-NEXT:    v_or_b32_e32 v25, v55, v36
+; GCN-NEXT:    v_or_b32_e32 v26, v40, v50
+; GCN-NEXT:    v_or_b32_e32 v27, v42, v41
+; GCN-NEXT:    v_or_b32_e32 v28, v43, v45
+; GCN-NEXT:    v_or_b32_e32 v29, v44, v56
+; GCN-NEXT:    v_or_b32_e32 v30, v46, v58
+; GCN-NEXT:    v_or_b32_e32 v31, v47, v59
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -43769,19 +43836,18 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -43813,59 +43879,61 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -43874,25 +43942,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -43900,25 +43968,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -43926,25 +43994,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -43952,10 +44020,10 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -43967,10 +44035,10 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -43981,22 +44049,22 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -44007,14 +44075,14 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -44026,34 +44094,33 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB29_2
 ; VI-NEXT:  ; %bb.1: ; %cmp.false
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -44061,19 +44128,19 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -44084,23 +44151,23 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr55
+; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr38
@@ -44141,147 +44208,217 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; kill: killed $vgpr33
 ; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr54
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -44291,16 +44428,16 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -44309,7 +44446,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -44318,253 +44455,190 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr33
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:  .LBB29_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB29_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v31, 0x300
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v61
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_add_u16_e32 v8, 3, v8
+; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v1, 3, v1
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v1
+; VI-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; VI-NEXT:    v_add_u16_sdwa v1, v1, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -44573,31 +44647,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v1, v2, v3
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(13)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(11)
+; VI-NEXT:    s_waitcnt vmcnt(10)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -44634,13 +44702,13 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v63
+; VI-NEXT:    v_add_u16_e32 v8, 3, v62
 ; VI-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v59
+; VI-NEXT:    v_add_u16_e32 v9, 3, v32
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v62
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
@@ -44648,27 +44716,28 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
 ; VI-NEXT:    v_add_u16_e32 v10, 3, v58
 ; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v60
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v57
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v46
+; VI-NEXT:    v_add_u16_e32 v12, 3, v47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -44679,148 +44748,149 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v54
+; VI-NEXT:    v_add_u16_e32 v14, 3, v42
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v40
-; VI-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v15
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v15, v15, v16
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v17, v17, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v16, v16, v17
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v18, v18, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v17, v17, v18
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v19, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v18, v18, v19
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v20, v20, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v19, v19, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v21, v21, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v20, v20, v21
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v22, v22, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v21, v21, v22
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v23, v23, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v23
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v24, v24, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v23, v23, v24
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
 ; VI-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v25, v25, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v24, v24, v25
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -44834,19 +44904,19 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v26, v26, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v25, v25, v26
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v26, 3, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
 ; VI-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v27, v27, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v27
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -44860,20 +44930,20 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v28, v28, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v28
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v28, 3, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v29, v29, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v29
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -44886,7 +44956,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v30
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v30, 3, v30
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -44899,7 +44969,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v30, v30, v32
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v32, 3, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -44908,7 +44978,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v32, 0x300, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v33, 3, v33
-; VI-NEXT:    v_or_b32_sdwa v33, v34, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v33, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v31, v33, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v31, v32, v31
 ; VI-NEXT:  .LBB29_4: ; %end
@@ -44985,19 +45055,18 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -45009,93 +45078,95 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v29
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
@@ -45105,25 +45176,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -45132,25 +45203,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -45159,25 +45230,25 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -45186,10 +45257,10 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -45201,10 +45272,10 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -45216,22 +45287,22 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -45243,14 +45314,14 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -45262,34 +45333,33 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB29_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -45297,20 +45367,20 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -45321,23 +45391,23 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr38
@@ -45378,147 +45448,217 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; kill: killed $vgpr33
 ; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -45528,16 +45668,16 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -45546,7 +45686,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -45555,258 +45695,196 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr33
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:  .LBB29_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB29_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(13)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(33)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
+; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(23)
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_add_u16_sdwa v1, v1, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -45815,32 +45893,26 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_e32 v1, v2, v3
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v4, v4, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v5, 3, v5
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(12)
+; GFX9-NEXT:    s_waitcnt vmcnt(11)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -45877,41 +45949,41 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v32
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v57
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -45922,148 +45994,149 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v54
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v40
-; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v17, v17, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v16, v16, v17
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v18, v18, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v17, v17, v18
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v19, v19, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v18, v18, v19
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v20, v20, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v19, v19, v20
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v21, v21, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v20, v20, v21
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v22, v22, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v21, v21, v22
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v23, v23, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v22, v22, v23
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v24, v24, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v23, v23, v24
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v25, v25, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v24, v24, v25
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -46077,19 +46150,19 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v26, v26, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v25, v25, v26
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v26, 3, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v27, v27, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v26, v26, v27
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -46103,20 +46176,20 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v28, v28, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v27, v27, v28
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v28, 3, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v29, v29, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v28, v28, v29
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -46129,7 +46202,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v30, v30, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v29, v29, v30
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v30, 3, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -46142,7 +46215,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v31, v31, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v30, v30, v31
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -46151,7 +46224,7 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v32, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v32, v33, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v32, v63, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v32, v32, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v31, v31, v32
 ; GFX9-NEXT:  .LBB29_4: ; %end
@@ -48191,11 +48264,11 @@ define <64 x bfloat> @bitcast_v16i64_to_v64bf16(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v32
 ; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -48298,7 +48371,7 @@ define <64 x bfloat> @bitcast_v16i64_to_v64bf16(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB30_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1) expcnt(0)
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v62
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v62
 ; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
@@ -49072,29 +49145,27 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:72
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32
 ; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v1
 ; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v0
 ; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v3
@@ -49105,68 +49176,62 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v46, 1.0, v9
 ; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v8
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v11
 ; GCN-NEXT:    v_mul_f32_e32 v45, 1.0, v10
-; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -49176,69 +49241,79 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
 ; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v33
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v5
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v42
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v44
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v55
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v40
+; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v51
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v52
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v48
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v50
-; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v50
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v38
+; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v44
+; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v63
 ; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
-; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v63
+; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v42
 ; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
-; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v43
+; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v55
 ; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
-; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v41
+; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v54
 ; GCN-NEXT:    v_mul_f32_e32 v49, 1.0, v49
-; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v54
+; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v31
-; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v54, 1.0, v0
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v1
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v2
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v0
+; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -49255,91 +49330,87 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v3, v3, v57, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v46
 ; GCN-NEXT:    v_alignbit_b32 v4, v4, v47, 16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v43
 ; GCN-NEXT:    v_alignbit_b32 v5, v5, v45, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v6, v6, v7, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v6, v6, v41, 16
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v7, v7, v8, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v8, v8, v9, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v9, v9, v10, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v10, v10, v11, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v11, v11, v12, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v12, v12, v13, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v13, v13, v14, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v14, v14, v15, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v34
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v15, v15, v16, 16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v16, v16, v33, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v17, v17, v18, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
 ; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v36
@@ -49347,30 +49418,34 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v39
 ; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v49
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v52
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v54
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v41
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v43
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v54
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v55
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v42
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v44
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v18, v18, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v19, v19, v42, 16
-; GCN-NEXT:    v_alignbit_b32 v20, v20, v44, 16
+; GCN-NEXT:    v_alignbit_b32 v19, v19, v40, 16
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v20, v20, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v21, v21, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v22, v22, v48, 16
-; GCN-NEXT:    v_alignbit_b32 v23, v23, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v22, v22, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v23, v23, v48, 16
 ; GCN-NEXT:    v_alignbit_b32 v24, v24, v50, 16
 ; GCN-NEXT:    v_alignbit_b32 v25, v25, v51, 16
-; GCN-NEXT:    v_alignbit_b32 v26, v26, v53, 16
-; GCN-NEXT:    v_alignbit_b32 v27, v27, v55, 16
-; GCN-NEXT:    v_alignbit_b32 v28, v28, v40, 16
-; GCN-NEXT:    v_alignbit_b32 v29, v29, v63, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v26, v26, v52, 16
+; GCN-NEXT:    v_alignbit_b32 v27, v27, v53, 16
+; GCN-NEXT:    v_alignbit_b32 v28, v28, v63, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v29, v29, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v30, v30, v32, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v31, v31, v32, 16
 ; GCN-NEXT:    ; implicit-def: $vgpr62
@@ -49383,13 +49458,11 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -49438,10 +49511,11 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
@@ -49449,26 +49523,27 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; kill: killed $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; kill: killed $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:  .LBB31_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB31_4
@@ -49505,104 +49580,100 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_alignbit_b32 v4, v5, v4, 16
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v43
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_alignbit_b32 v5, v6, v5, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v41
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_alignbit_b32 v6, v7, v6, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_alignbit_b32 v7, v8, v7, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_alignbit_b32 v8, v9, v8, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_alignbit_b32 v9, v10, v9, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_alignbit_b32 v10, v11, v10, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_alignbit_b32 v11, v12, v11, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_alignbit_b32 v12, v13, v12, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_alignbit_b32 v13, v14, v13, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_alignbit_b32 v14, v15, v14, 16
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v34
@@ -49611,69 +49682,73 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v15, v16, v15, 16
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v33
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_alignbit_b32 v16, v17, v16, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_alignbit_b32 v17, v18, v17, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v42
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v40
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v44
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v48
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v38
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v38
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v48
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v36
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v50
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v35
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v51
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff0000, v37
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v53
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v52
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v39
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v55
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v53
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v49
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v40
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v63
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v54
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v54
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v48
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v55
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v50
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v41
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v42
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v43
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v44
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v20
@@ -51366,43 +51441,42 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v37, v18, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v20, 0xffff, v20, v32
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v37, 0x40c00000, v38 :: v_dual_cndmask_b32 v34, v34, v36
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v36, 0x400000, v18
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v38, 16, v16
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v48, 0x400000, v37
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v19, 0xffff, v19, v33
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v18, v35, v36, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v38
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v38, v17, 16, 1
+; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v35, v37, 16, 1
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v39, v36, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v38, v17, 0x7fff
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v51, 0x400000, v36
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v35, v37, 0x7fff
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v39, v39, v36, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v17, v38, v49, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v16
-; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v18.l, v18.h
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v17.l, v17.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v36, v39, v51, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v18, 0xffff, v18, v34
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v36.l, v36.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v35, v35, v48, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v17, 0xffff, v17, v35
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v16, v38, v49, vcc_lo
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v16, 0xffff, v36, v16
 ; GFX11-TRUE16-NEXT:  .LBB31_2: ; %end
 ; GFX11-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s0
@@ -51563,15 +51637,15 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v34, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v37, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v33, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v34, 0x40c00000, v34
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v34, 0x40c00000, v34 :: v_dual_add_f32 v35, 0x40c00000, v37
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v35, 0x40c00000, v37
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v38, v34, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v32, v35, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v36, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
@@ -51920,17 +51994,16 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v39, 16, v16
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v18, v36, v37, vcc_lo
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v39
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v38, v35, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v38, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v39, v17, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v48, v36, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v49, 0x400000, v36
-; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v18, v18, v34, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v35, v37, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v38, v39, v17, 0x7fff
@@ -51938,7 +52011,7 @@ define <16 x i64> @bitcast_v64bf16_to_v16i64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v48, v48, v36, 0x7fff
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v38, v39, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v37, v16, 0x7fff
@@ -51990,12 +52063,12 @@ define <64 x half> @bitcast_v16i64_to_v64f16(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -52099,7 +52172,6 @@ define <64 x half> @bitcast_v16i64_to_v64f16(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB32_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v62
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v63
@@ -52937,28 +53009,26 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v0
@@ -52971,67 +53041,61 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v9
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v10
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v49
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -53041,28 +53105,27 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v48
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v4
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v43
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v41
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v41
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v55
@@ -53070,46 +53133,58 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v53
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v54
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v54
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v50
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v52
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v48
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v39
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v49
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v35
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v37
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v33
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v31
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v0
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v3
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v7
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB33_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v61
 ; GCN-NEXT:    v_or_b32_e32 v0, v62, v0
@@ -53121,123 +53196,123 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v47
 ; GCN-NEXT:    v_or_b32_e32 v4, v46, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v5, v44, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v44
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v38
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v32
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v32
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v38
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v43, v19
-; GCN-NEXT:    v_or_b32_e32 v20, v41, v20
-; GCN-NEXT:    v_or_b32_e32 v21, v55, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v49, v22
-; GCN-NEXT:    v_or_b32_e32 v23, v50, v23
-; GCN-NEXT:    v_or_b32_e32 v24, v39, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v41, v19
+; GCN-NEXT:    v_or_b32_e32 v20, v55, v20
+; GCN-NEXT:    v_or_b32_e32 v21, v53, v21
+; GCN-NEXT:    v_or_b32_e32 v22, v50, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v39, v23
+; GCN-NEXT:    v_or_b32_e32 v24, v35, v24
 ; GCN-NEXT:    v_or_b32_e32 v25, v36, v25
-; GCN-NEXT:    v_or_b32_e32 v26, v48, v26
+; GCN-NEXT:    v_or_b32_e32 v26, v49, v26
 ; GCN-NEXT:    v_or_b32_e32 v27, v52, v27
-; GCN-NEXT:    v_or_b32_e32 v28, v53, v28
-; GCN-NEXT:    v_or_b32_e32 v29, v54, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v40, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v42, v31
+; GCN-NEXT:    v_or_b32_e32 v28, v54, v28
+; GCN-NEXT:    v_or_b32_e32 v29, v40, v29
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -53249,10 +53324,8 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -53290,6 +53363,7 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -53302,42 +53376,44 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:  .LBB33_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB33_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v63
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v62
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v61
@@ -53379,19 +53455,15 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v44
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
@@ -53400,10 +53472,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
@@ -53412,10 +53484,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
@@ -53424,10 +53496,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
@@ -53436,10 +53508,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
@@ -53448,10 +53520,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
@@ -53460,10 +53532,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
@@ -53472,10 +53544,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
@@ -53484,10 +53556,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
@@ -53497,7 +53569,7 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
@@ -53506,10 +53578,8 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
@@ -53518,10 +53588,10 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
@@ -53530,51 +53600,56 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v41
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v41
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v55
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v55
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v53
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v49
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v50
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v50
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v39
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v39
-; GCN-NEXT:    v_mov_b32_e32 v39, v32
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v35
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v36
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v39
-; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v48
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v49
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v52
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v53
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v54
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v35
-; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v40
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v40
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
@@ -53590,8 +53665,8 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
 ; GCN-NEXT:    v_add_f32_e32 v32, 0x38000000, v32
+; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
 ; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
-; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
 ; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
@@ -53599,9 +53674,9 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
 ; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
-; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
-; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
 ; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
+; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
+; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
@@ -53618,18 +53693,18 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v30
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v31
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v36
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v48
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v49
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v50
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v51
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
@@ -53639,12 +53714,12 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_or_b32_e32 v19, v21, v20
 ; GCN-NEXT:    v_or_b32_e32 v20, v55, v39
@@ -53653,12 +53728,12 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v50
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v51
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v36
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v38
+; GCN-NEXT:    v_or_b32_e32 v26, v26, v35
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v36
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v33
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v34
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v35
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v37
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v37
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v38
 ; GCN-NEXT:  .LBB33_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Reload
@@ -53677,7 +53752,7 @@ define <16 x i64> @bitcast_v64f16_to_v16i64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: bitcast_v64f16_to_v16i64:
@@ -53934,10 +54009,10 @@ define <64 x i16> @bitcast_v16i64_to_v64i16(<16 x i64> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -54473,11 +54548,11 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mov_b32_e32 v37, v20
 ; GCN-NEXT:    v_mov_b32_e32 v38, v18
 ; GCN-NEXT:    v_mov_b32_e32 v39, v16
@@ -54489,127 +54564,128 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v53, v4
 ; GCN-NEXT:    v_mov_b32_e32 v54, v2
 ; GCN-NEXT:    v_mov_b32_e32 v55, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
+; GCN-NEXT:    s_waitcnt expcnt(6)
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(5)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(12) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v6
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v11
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v7
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -54618,132 +54694,132 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v55
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v54
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v36
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v59
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v58
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v53
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v57
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v52
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v35
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v56
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v51
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v60
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v50
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v49
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v48
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v39
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v38
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v43
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v46
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v45
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v46
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v45
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v32
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v34
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v42
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v41
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v40
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v63
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v62
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v47
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v33
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v44
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v44
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v43
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v42
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v41
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v63
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v62
+; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v61
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v60
+; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v32
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v33
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v35
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v34
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v47
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v19, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v21, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v22, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v27, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v32
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v59
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr53
@@ -54765,81 +54841,81 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:  .LBB35_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB35_4
@@ -54847,7 +54923,7 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v36, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v59, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v54
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v58, v1
@@ -54856,7 +54932,7 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v2, v57, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v52
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v35, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v56, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x30000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v51
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v50
@@ -54865,39 +54941,37 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v39
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v38
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v15
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v43
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v56
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v46
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v45
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v32
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v34
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v42
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v41
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v40
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v47
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v33
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v46
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v45
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v42
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v41
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v40
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v60
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v32
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v33
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v35
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v34
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -54926,86 +55000,88 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v4, v60, v4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v32, v8
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v32, v9
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v32, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v32, v11
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v32, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v32, v13
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v32, v15
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v32, v25
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v32, v27
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v32, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v59, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v36, v31
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x30000, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -55047,9 +55123,7 @@ define <16 x i64> @bitcast_v64i16_to_v16i64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:152 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:156 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:160 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(11)
 ; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:164 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(10)
 ; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:168 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:172 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:176 ; 4-byte Folded Reload
@@ -55314,10 +55388,10 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
@@ -56557,8 +56631,8 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    ; kill: killed $vgpr39
@@ -57198,13 +57272,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 40, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v12, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 44, v0
@@ -57224,13 +57299,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v14, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 52, v0
@@ -57250,13 +57326,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 56, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v16, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 60, v0
@@ -57276,13 +57353,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 64, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v18, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x44, v0
@@ -57302,13 +57380,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x4c, v0
@@ -57328,13 +57407,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x50, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x54, v0
@@ -57354,13 +57434,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x58, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x5c, v0
@@ -57380,13 +57461,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x60, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x64, v0
@@ -57431,13 +57513,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x70, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x74, v0
@@ -57505,8 +57588,8 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; kill: killed $vgpr41
@@ -57700,14 +57783,13 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(32)
+; GFX9-NEXT:    s_waitcnt vmcnt(31)
 ; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB36_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
-; GFX9-NEXT:    s_waitcnt vmcnt(31)
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 24, v32
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
@@ -58309,13 +58391,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:72
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:76
@@ -58335,13 +58418,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:80
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:84
@@ -58361,13 +58445,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:88
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:92
@@ -58387,13 +58472,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:96
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:100
@@ -58413,13 +58499,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:104
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:108
@@ -58439,13 +58526,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:112
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:116
@@ -58465,13 +58553,14 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:120
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v32, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:124
@@ -59750,154 +59839,147 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v55, v0
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:92
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:388
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:60
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:20
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:16
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v48, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:388
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:116
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v12
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 24, v10
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v5
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v20
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v20
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v18
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v7
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
@@ -59905,295 +59987,304 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:140
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:148
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:168
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:164
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:164
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:172
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:200
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:196
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:204
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:216
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:244
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:276
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:308
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:340
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 8, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:368
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:380
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:384
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:368
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:376
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 24, v0
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v1
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:372
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:76
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:12
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:372
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:364
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 24, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v4
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB37_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v39
+; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v55
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v38
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v56
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v48
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v3, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v38
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v56
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v47
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v54
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v54
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v47
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v46
@@ -60201,101 +60292,99 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v37
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v53
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v43
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v36
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v45
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
+; GCN-NEXT:    v_or_b32_e32 v8, v8, v37
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v53
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v45
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v42
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v40
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v52
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v63
@@ -60303,217 +60392,221 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v50
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v58
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v51
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v39
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v49
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v60
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v51
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v58
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v50
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v62
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v61
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v59
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v61
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v60
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v34
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v52
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v62
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v35
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v33
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v44
-; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v57
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v43
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v50, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v51, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v52, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v33, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v36, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v36, v37, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v37
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v37, v38, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v38, v39, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xff, v39
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v48, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v49, v39
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v48, v55, v48
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v48, v49, v48
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xff, v49
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v49, v54, v49
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
@@ -60544,6 +60637,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v39
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v48
 ; GCN-NEXT:    v_or_b32_e32 v31, v31, v49
+; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -60576,13 +60670,14 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -60667,37 +60762,34 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr42
@@ -60706,13 +60798,13 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -60727,19 +60819,20 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr59
@@ -60748,297 +60841,298 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; kill: killed $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; kill: killed $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; kill: killed $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:  .LBB37_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB37_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v0
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v39, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v38, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v56, v1
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v48, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v38, v2
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v56, v2
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v47, v3
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v54, v3
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v54, v4
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v47, v4
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v46, v5
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v37, v6
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v6, v53, v6
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v43, v7
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v48
+; GCN-NEXT:    v_or_b32_e32 v7, v36, v7
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v45, v8
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v37, v8
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v53, v9
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v9, v45, v9
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v42, v10
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v41, v11
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v40, v12
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v52, v12
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v63, v13
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v50, v14
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v58, v14
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v0, v15
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v0, v16
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v0, v17
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v51, v18
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v18, v39, v18
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v49, v19
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v19, v0, v19
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v60, v20
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v51, v20
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v58, v21
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v50, v21
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v25, v62, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v61, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v29, v59, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v37, v32, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v36, v32, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v50, v61, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v60, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v41, v34, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v45, v52, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v45, v62, v22
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v56, v35, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v58, v33, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v59, v44, v22
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v57
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_or_b32_e32 v57, v36, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v60, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v43, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v61, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v62, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v63, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v63, v0, v22
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v38, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v37, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v38, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v49, v0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_mov_b32_e32 v0, v49
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v51, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v49, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v51, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v52, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v52, v24, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v57
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v54, v23, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v54, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v22
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v57, v24, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
 ; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
@@ -61046,15 +61140,15 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v24
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v26, v24
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v26
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v27, v26
 ; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
@@ -61062,151 +61156,150 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v28, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v30, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v31, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v35, v34
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v35
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v48, v35
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v48, vcc, 3, v48
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v48, v53, v48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, 3, v53
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_mov_b32_e32 v0, v55
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v55, v53
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v55
 ; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v55, v40, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v40
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v40, v42, v40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 3, v42
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xff, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v42, v43, v42
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v43
 ; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v43, v44, v43
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v44
 ; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v44, v46, v44
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v0, v44
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v46, vcc, 3, v46
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xff, v46
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v46, v0, v46
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v47, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v47, v0, v47
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v60, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v61, v0
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v61, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v62, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v62, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v63, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v63, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v22, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
@@ -61226,16 +61319,16 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, s7, v37
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v50
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v41
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v45
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v56
-; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v58
-; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v59
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, 0x300, v57
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, s7, v25
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v36
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, s7, v50
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, s7, v41
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, s7, v45
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, s7, v56
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, s7, v58
+; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v59
+; GCN-NEXT:    v_add_i32_e32 v59, vcc, 0x300, v60
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -61254,24 +61347,24 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v36
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff, v50
 ; GCN-NEXT:    v_and_b32_e32 v41, 0xffff, v41
 ; GCN-NEXT:    v_and_b32_e32 v45, 0xffff, v45
 ; GCN-NEXT:    v_and_b32_e32 v56, 0xffff, v56
 ; GCN-NEXT:    v_and_b32_e32 v58, 0xffff, v58
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xffff, v59
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff, v57
-; GCN-NEXT:    v_or_b32_e32 v4, v36, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v37, v4
 ; GCN-NEXT:    v_or_b32_e32 v5, v38, v5
 ; GCN-NEXT:    v_or_b32_e32 v6, v39, v6
 ; GCN-NEXT:    v_or_b32_e32 v7, v49, v7
 ; GCN-NEXT:    v_or_b32_e32 v8, v51, v8
 ; GCN-NEXT:    v_or_b32_e32 v9, v52, v9
 ; GCN-NEXT:    v_or_b32_e32 v10, v54, v10
-; GCN-NEXT:    v_or_b32_e32 v11, v22, v11
+; GCN-NEXT:    v_or_b32_e32 v11, v57, v11
 ; GCN-NEXT:    v_or_b32_e32 v12, v23, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v24, v13
 ; GCN-NEXT:    v_or_b32_e32 v14, v26, v14
@@ -61282,16 +61375,16 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
 ; GCN-NEXT:    v_or_b32_e32 v20, v33, v20
 ; GCN-NEXT:    v_or_b32_e32 v21, v34, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v35, v25
-; GCN-NEXT:    v_or_b32_e32 v23, v48, v29
-; GCN-NEXT:    v_or_b32_e32 v24, v53, v37
-; GCN-NEXT:    v_or_b32_e32 v25, v55, v50
-; GCN-NEXT:    v_or_b32_e32 v26, v40, v41
-; GCN-NEXT:    v_or_b32_e32 v27, v42, v45
-; GCN-NEXT:    v_or_b32_e32 v28, v43, v56
-; GCN-NEXT:    v_or_b32_e32 v29, v44, v58
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v59
-; GCN-NEXT:    v_or_b32_e32 v31, v47, v57
+; GCN-NEXT:    v_or_b32_e32 v22, v35, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v48, v25
+; GCN-NEXT:    v_or_b32_e32 v24, v53, v29
+; GCN-NEXT:    v_or_b32_e32 v25, v55, v36
+; GCN-NEXT:    v_or_b32_e32 v26, v40, v50
+; GCN-NEXT:    v_or_b32_e32 v27, v42, v41
+; GCN-NEXT:    v_or_b32_e32 v28, v43, v45
+; GCN-NEXT:    v_or_b32_e32 v29, v44, v56
+; GCN-NEXT:    v_or_b32_e32 v30, v46, v58
+; GCN-NEXT:    v_or_b32_e32 v31, v47, v59
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -61397,19 +61490,18 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -61441,59 +61533,61 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -61502,25 +61596,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -61528,25 +61622,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -61554,25 +61648,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -61580,10 +61674,10 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -61595,10 +61689,10 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -61609,22 +61703,22 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -61635,14 +61729,14 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -61654,34 +61748,33 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB37_2
 ; VI-NEXT:  ; %bb.1: ; %cmp.false
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -61689,19 +61782,19 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -61712,23 +61805,23 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr55
+; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr38
@@ -61769,147 +61862,217 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; kill: killed $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; kill: killed $vgpr33
 ; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr54
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -61919,16 +62082,16 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -61937,7 +62100,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -61946,253 +62109,190 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr33
-; VI-NEXT:    ; kill: killed $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr32
+; VI-NEXT:    ; kill: killed $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:  .LBB37_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB37_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v31, 0x300
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v61
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_add_u16_e32 v8, 3, v8
+; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v1, 3, v1
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v1
+; VI-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; VI-NEXT:    v_add_u16_sdwa v1, v1, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -62201,31 +62301,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v1, v2, v3
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(13)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(11)
+; VI-NEXT:    s_waitcnt vmcnt(10)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v9, v9, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -62262,13 +62356,13 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v63
+; VI-NEXT:    v_add_u16_e32 v8, 3, v62
 ; VI-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v59
+; VI-NEXT:    v_add_u16_e32 v9, 3, v32
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v62
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
@@ -62276,27 +62370,28 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
 ; VI-NEXT:    v_add_u16_e32 v10, 3, v58
 ; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v60
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v57
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v46
+; VI-NEXT:    v_add_u16_e32 v12, 3, v47
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -62307,148 +62402,149 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v54
+; VI-NEXT:    v_add_u16_e32 v14, 3, v42
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v40
-; VI-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v15
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v15, 3, v15
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v15, v15, v16
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v17, v17, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v16, v16, v17
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v18, v18, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v17, v17, v18
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v18, 3, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v19, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v18, v18, v19
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v20, v20, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v19, v19, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v21, v21, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v20, v20, v21
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v22, v22, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v21, v21, v22
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v23, v23, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v23
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v23, 3, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v24, v24, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v23, v23, v24
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v24, 3, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
 ; VI-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v25, v25, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v24, v24, v25
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -62462,19 +62558,19 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v26, v26, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v25, v25, v26
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v26, 3, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
 ; VI-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v27, v27, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v27
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -62488,20 +62584,20 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v28, v28, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v28
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v28, 3, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v29, v29, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v29
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v29, 3, v29
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -62514,7 +62610,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v30
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v30, 3, v30
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -62527,7 +62623,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v30, v30, v32
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v32, 3, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -62536,7 +62632,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v32, 0x300, v32
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v33, 3, v33
-; VI-NEXT:    v_or_b32_sdwa v33, v34, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v33, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v31, v33, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v31, v32, v31
 ; VI-NEXT:  .LBB37_4: ; %end
@@ -62613,19 +62709,18 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:136
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:144
-; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:152
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:160
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:168
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:176
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:144
+; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:152
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:160
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:168
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:176
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:184
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v13
@@ -62637,93 +62732,95 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v29
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(21)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v54
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v32
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v40
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v42
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:172
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:180
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v45
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v46
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v47
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v56
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
@@ -62733,25 +62830,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:204
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:212
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:224
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:232
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:240
@@ -62760,25 +62857,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:220
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -62787,25 +62884,25 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:252
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -62814,10 +62911,10 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
@@ -62829,10 +62926,10 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:308
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:320
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:328
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:336
@@ -62844,22 +62941,22 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:340
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:352
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:360
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:368
@@ -62871,14 +62968,14 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
@@ -62890,34 +62987,33 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
 ; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB37_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -62925,20 +63021,20 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v55 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v41 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -62949,23 +63045,23 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v57, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr38
@@ -63006,147 +63102,217 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; kill: killed $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; kill: killed $vgpr33
 ; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v32, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v54, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -63156,16 +63322,16 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
@@ -63174,7 +63340,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
@@ -63183,258 +63349,196 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v63 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; kill: killed $vgpr32
 ; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr33
-; GFX9-NEXT:    ; kill: killed $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; kill: killed $vgpr63
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:  .LBB37_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB37_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(13)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(33)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
+; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v1, v42, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v41, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v3, v55, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(23)
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v1
+; GFX9-NEXT:    v_or_b32_sdwa v1, v41, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_add_u16_sdwa v1, v1, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v2
@@ -63443,32 +63547,26 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_e32 v1, v2, v3
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v50, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v4, v4, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v5, 3, v5
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v48, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v38, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(12)
+; GFX9-NEXT:    s_waitcnt vmcnt(11)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v36, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v34, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v53, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v52, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -63505,41 +63603,41 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v35, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v33, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v32
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v57
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
@@ -63550,148 +63648,149 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v54
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v40
-; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v17, v17, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v16, v16, v17
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v18, v18, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v17, v17, v18
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v19, v19, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v18, v18, v19
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v20, v20, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v19, v19, v20
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v21, v21, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v20, v20, v21
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v22, v22, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v21, v21, v22
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v23, v23, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v22, v22, v23
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v23, 3, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v24, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v23
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v24, v24, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v23, v23, v24
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v26, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v25, v25, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v24, v24, v25
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v25, 3, v25
@@ -63705,19 +63804,19 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v26, v26, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v25, v25, v26
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v26, 3, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v27, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v27, v27, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v26, v26, v27
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -63731,20 +63830,20 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v28, v28, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v27, v27, v28
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v28, 3, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v28, v29, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v30, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v29, v29, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v28, v28, v29
-; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v29, 3, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -63757,7 +63856,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v30, v30, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v29, v29, v30
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v30, 3, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -63770,7 +63869,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v31, v31, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v30, v30, v31
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -63779,7 +63878,7 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v32, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v32, v33, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v32, v63, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v32, v32, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_e32 v31, v31, v32
 ; GFX9-NEXT:  .LBB37_4: ; %end
@@ -65819,10 +65918,10 @@ define <64 x bfloat> @bitcast_v16f64_to_v64bf16(<16 x double> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr34
@@ -65925,7 +66024,6 @@ define <64 x bfloat> @bitcast_v16f64_to_v64bf16(<16 x double> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB38_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff0000, v32
 ; GCN-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
@@ -66609,29 +66707,27 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:72
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32
 ; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v1
 ; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v0
 ; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v3
@@ -66642,68 +66738,62 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v46, 1.0, v9
 ; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v8
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v11
 ; GCN-NEXT:    v_mul_f32_e32 v45, 1.0, v10
-; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -66713,69 +66803,79 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
 ; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v33
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v5
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v42
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v44
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v55
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v40
+; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v51
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v53
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v52
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v48
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v50
-; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v50
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v38
+; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v44
+; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v63
 ; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
-; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v63
+; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v42
 ; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
-; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v43
+; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v55
 ; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
-; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v41
+; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v54
 ; GCN-NEXT:    v_mul_f32_e32 v49, 1.0, v49
-; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v54
+; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v31
-; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_mul_f32_e32 v54, 1.0, v0
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v1
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v2
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v6
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v0
+; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -66792,91 +66892,87 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v3, v3, v57, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v46
 ; GCN-NEXT:    v_alignbit_b32 v4, v4, v47, 16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v43
 ; GCN-NEXT:    v_alignbit_b32 v5, v5, v45, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v6, v6, v7, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v6, v6, v41, 16
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v7, v7, v8, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v8, v8, v9, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v9, v9, v10, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v10, v10, v11, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v11, v11, v12, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v12, v12, v13, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v13, v13, v14, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v14, v14, v15, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v34
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v15, v15, v16, 16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v16, v16, v33, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v17, v17, v18, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
 ; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v36
@@ -66884,30 +66980,34 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v39
 ; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v49
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v52
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v54
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v41
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v43
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v54
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v55
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v42
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v44
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v18, v18, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v19, v19, v42, 16
-; GCN-NEXT:    v_alignbit_b32 v20, v20, v44, 16
+; GCN-NEXT:    v_alignbit_b32 v19, v19, v40, 16
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v20, v20, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v21, v21, v32, 16
-; GCN-NEXT:    v_alignbit_b32 v22, v22, v48, 16
-; GCN-NEXT:    v_alignbit_b32 v23, v23, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v22, v22, v38, 16
+; GCN-NEXT:    v_alignbit_b32 v23, v23, v48, 16
 ; GCN-NEXT:    v_alignbit_b32 v24, v24, v50, 16
 ; GCN-NEXT:    v_alignbit_b32 v25, v25, v51, 16
-; GCN-NEXT:    v_alignbit_b32 v26, v26, v53, 16
-; GCN-NEXT:    v_alignbit_b32 v27, v27, v55, 16
-; GCN-NEXT:    v_alignbit_b32 v28, v28, v40, 16
-; GCN-NEXT:    v_alignbit_b32 v29, v29, v63, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v26, v26, v52, 16
+; GCN-NEXT:    v_alignbit_b32 v27, v27, v53, 16
+; GCN-NEXT:    v_alignbit_b32 v28, v28, v63, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v29, v29, v32, 16
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v30, v30, v32, 16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v31, v31, v32, 16
 ; GCN-NEXT:    ; implicit-def: $vgpr62
@@ -66920,13 +67020,11 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; kill: killed $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -66975,10 +67073,11 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr35
@@ -66986,26 +67085,27 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; kill: killed $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; kill: killed $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; kill: killed $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:  .LBB39_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB39_4
@@ -67042,104 +67142,100 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_alignbit_b32 v4, v5, v4, 16
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v43
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_alignbit_b32 v5, v6, v5, 16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v41
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_alignbit_b32 v6, v7, v6, 16
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_alignbit_b32 v7, v8, v7, 16
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_alignbit_b32 v8, v9, v8, 16
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_alignbit_b32 v9, v10, v9, 16
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_alignbit_b32 v10, v11, v10, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_alignbit_b32 v11, v12, v11, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_alignbit_b32 v12, v13, v12, 16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_alignbit_b32 v13, v14, v13, 16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_alignbit_b32 v14, v15, v14, 16
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v34
@@ -67148,69 +67244,73 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_alignbit_b32 v15, v16, v15, 16
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v33
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_alignbit_b32 v16, v17, v16, 16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_alignbit_b32 v17, v18, v17, 16
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v42
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v40
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v44
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v48
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v38
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v38
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v48
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v36
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v50
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v35
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v51
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff0000, v37
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v53
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v52
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v39
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v55
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v53
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v49
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v40
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v63
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v54
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v54
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v48
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v55
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v50
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v41
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v42
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v43
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v44
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v20
@@ -68903,43 +69003,42 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v37, v18, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v20, 0xffff, v20, v32
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v37, 0x40c00000, v38 :: v_dual_cndmask_b32 v34, v34, v36
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v36, 0x400000, v18
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v38, 16, v16
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v48, 0x400000, v37
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v19, 0xffff, v19, v33
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v18, v35, v36, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v38
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v38, v17, 16, 1
+; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v17
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v35, v37, 16, 1
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v39, v36, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v38, v17, 0x7fff
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v51, 0x400000, v36
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v35, v37, 0x7fff
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v39, v39, v36, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v17, v38, v49, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v16, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v49, 0x400000, v16
-; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v18.l, v18.h
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v17.l, v17.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v36, v39, v51, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v16, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v18, 0xffff, v18, v34
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v36.l, v36.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v35, v35, v48, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v17, 0xffff, v17, v35
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v16, v38, v49, vcc_lo
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v16, 0xffff, v36, v16
 ; GFX11-TRUE16-NEXT:  .LBB39_2: ; %end
 ; GFX11-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s0
@@ -69100,15 +69199,15 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v34, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v37, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v33, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v34, 0x40c00000, v34
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v34, 0x40c00000, v34 :: v_dual_add_f32 v35, 0x40c00000, v37
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v35, 0x40c00000, v37
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v38, v34, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v32, v35, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v36, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
@@ -69457,17 +69556,16 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v39, 16, v16
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v18, v36, v37, vcc_lo
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v39
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v38, v35, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v38, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v39, v17, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v48, v36, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v49, 0x400000, v36
-; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v18, v18, v34, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v35, v37, v38, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v38, v39, v17, 0x7fff
@@ -69475,7 +69573,7 @@ define <16 x double> @bitcast_v64bf16_to_v16f64(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v48, v48, v36, 0x7fff
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_or_b32_e32 v50, 0x400000, v16
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v38, v39, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v37, v37, v16, 0x7fff
@@ -69527,10 +69625,10 @@ define <64 x half> @bitcast_v16f64_to_v64f16(<16 x double> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr34
@@ -69633,7 +69731,6 @@ define <64 x half> @bitcast_v16f64_to_v64f16(<16 x double> %a, i32 %b) {
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB40_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v34, 16, v31
@@ -70383,28 +70480,26 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v0
@@ -70417,67 +70512,61 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v9
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v10
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v15
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v17
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v16
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v18
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v21
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v20
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v23
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v25
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v27
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v29
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v28
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v49
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:20
@@ -70487,28 +70576,27 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:116
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
+; GCN-NEXT:    s_waitcnt vmcnt(14)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v48
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v4
+; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v43
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v41
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v41
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v55
@@ -70516,46 +70604,58 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v53
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v54
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v54
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v50
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v52
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v48
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v39
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v49
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v35
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v37
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v33
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v31
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v0
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v3
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v7
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB41_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v61
 ; GCN-NEXT:    v_or_b32_e32 v0, v62, v0
@@ -70567,123 +70667,123 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v47
 ; GCN-NEXT:    v_or_b32_e32 v4, v46, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v5, v44, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v44
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v38
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v32
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v32
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v38
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v43, v19
-; GCN-NEXT:    v_or_b32_e32 v20, v41, v20
-; GCN-NEXT:    v_or_b32_e32 v21, v55, v21
-; GCN-NEXT:    v_or_b32_e32 v22, v49, v22
-; GCN-NEXT:    v_or_b32_e32 v23, v50, v23
-; GCN-NEXT:    v_or_b32_e32 v24, v39, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v41, v19
+; GCN-NEXT:    v_or_b32_e32 v20, v55, v20
+; GCN-NEXT:    v_or_b32_e32 v21, v53, v21
+; GCN-NEXT:    v_or_b32_e32 v22, v50, v22
+; GCN-NEXT:    v_or_b32_e32 v23, v39, v23
+; GCN-NEXT:    v_or_b32_e32 v24, v35, v24
 ; GCN-NEXT:    v_or_b32_e32 v25, v36, v25
-; GCN-NEXT:    v_or_b32_e32 v26, v48, v26
+; GCN-NEXT:    v_or_b32_e32 v26, v49, v26
 ; GCN-NEXT:    v_or_b32_e32 v27, v52, v27
-; GCN-NEXT:    v_or_b32_e32 v28, v53, v28
-; GCN-NEXT:    v_or_b32_e32 v29, v54, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v40, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v42, v31
+; GCN-NEXT:    v_or_b32_e32 v28, v54, v28
+; GCN-NEXT:    v_or_b32_e32 v29, v40, v29
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
@@ -70695,10 +70795,8 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -70736,6 +70834,7 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -70748,42 +70847,44 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; kill: killed $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:  .LBB41_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB41_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v63
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v62
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v61
@@ -70825,19 +70926,15 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
 ; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v45
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v44
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v43
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
@@ -70846,10 +70943,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v7, v6
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
@@ -70858,10 +70955,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
@@ -70870,10 +70967,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
@@ -70882,10 +70979,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
@@ -70894,10 +70991,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
@@ -70906,10 +71003,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
@@ -70918,10 +71015,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v13, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
@@ -70930,10 +71027,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_or_b32_e32 v13, v14, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
@@ -70943,7 +71040,7 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v51
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
@@ -70952,10 +71049,8 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_or_b32_e32 v15, v16, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
@@ -70964,10 +71059,10 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
@@ -70976,51 +71071,56 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_or_b32_e32 v17, v18, v17
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v41
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v41
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v55
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v55
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v53
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v49
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v50
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v50
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v39
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v39
-; GCN-NEXT:    v_mov_b32_e32 v39, v32
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v35
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v36
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v39
-; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v48
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v49
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v52
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v53
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v54
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v35
-; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v40
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v40
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
@@ -71036,8 +71136,8 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
 ; GCN-NEXT:    v_add_f32_e32 v32, 0x38000000, v32
+; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
 ; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
-; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
 ; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
@@ -71045,9 +71145,9 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
 ; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
-; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
-; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
 ; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
+; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
+; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
@@ -71064,18 +71164,18 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v30
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v31
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v36
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v48
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v49
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v50
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v51
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
@@ -71085,12 +71185,12 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_or_b32_e32 v18, v19, v18
 ; GCN-NEXT:    v_or_b32_e32 v19, v21, v20
 ; GCN-NEXT:    v_or_b32_e32 v20, v55, v39
@@ -71099,12 +71199,12 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v50
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v51
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v36
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v38
+; GCN-NEXT:    v_or_b32_e32 v26, v26, v35
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v36
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v33
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v34
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v35
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v37
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v37
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v38
 ; GCN-NEXT:  .LBB41_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Reload
@@ -71123,7 +71223,7 @@ define <16 x double> @bitcast_v64f16_to_v16f64(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: bitcast_v64f16_to_v16f64:
@@ -71380,10 +71480,10 @@ define <64 x i16> @bitcast_v16f64_to_v64i16(<16 x double> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -71985,11 +72085,11 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:136 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mov_b32_e32 v37, v20
 ; GCN-NEXT:    v_mov_b32_e32 v38, v18
 ; GCN-NEXT:    v_mov_b32_e32 v39, v16
@@ -72001,127 +72101,128 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v53, v4
 ; GCN-NEXT:    v_mov_b32_e32 v54, v2
 ; GCN-NEXT:    v_mov_b32_e32 v55, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
+; GCN-NEXT:    s_waitcnt expcnt(6)
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(5)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(12) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v6
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v26
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v11
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v7
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -72130,132 +72231,132 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v55
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v54
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v36
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v59
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v58
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v53
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v57
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v52
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v35
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v56
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v51
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v60
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v50
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v49
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v48
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v39
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v38
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v43
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v46
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v56
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v45
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v46
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v45
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v32
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v34
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v42
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v41
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v40
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v63
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v62
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v47
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v33
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v44
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v44
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v43
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v42
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v41
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v63
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v62
+; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v61
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v60
+; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v32
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v33
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v35
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v34
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v47
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v19, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v21, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v22, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v23, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v27, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v28, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v32
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v59
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr53
@@ -72277,81 +72378,81 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; kill: killed $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; kill: killed $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; kill: killed $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:  .LBB43_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB43_4
@@ -72359,7 +72460,7 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v55
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v36, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v59, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v54
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v58, v1
@@ -72368,7 +72469,7 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v2, v57, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v52
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v35, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v56, v3
 ; GCN-NEXT:    s_mov_b32 s6, 0x30000
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v51
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v50
@@ -72377,39 +72478,37 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v39
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v38
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v37
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v11
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v15
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v43
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v56
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v46
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v45
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v32
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v34
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v42
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v41
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v40
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v61
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v47
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v33
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v46
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v45
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v42
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v41
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v40
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v60
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v32
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v33
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v35
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v34
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v47
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -72438,86 +72537,88 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v4, v60, v4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v32, v8
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v32, v9
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v32, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v32, v11
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v32, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v32, v13
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v32, v15
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v32, v25
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v32, v27
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v32, v30
-; GCN-NEXT:    v_or_b32_e32 v31, v59, v31
+; GCN-NEXT:    v_or_b32_e32 v31, v36, v31
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x30000, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -72559,9 +72660,7 @@ define <16 x double> @bitcast_v64i16_to_v16f64(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:152 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:156 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:160 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(11)
 ; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:164 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(10)
 ; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:168 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:172 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:176 ; 4-byte Folded Reload
@@ -72826,39 +72925,36 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:76
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:72
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:64
@@ -72877,11 +72973,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:44
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:40
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:32
@@ -72891,7 +72987,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:20
 ; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:16
@@ -72900,16 +72996,16 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:12
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v6
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v8
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
@@ -72918,14 +73014,16 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v14
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v16
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v20
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v15, 8, v22
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v22
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v24
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
@@ -72942,74 +73040,72 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:392
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:112
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:104
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v11
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v10
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:848 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v9
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v7
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v5
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v4
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:96
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v2
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:156
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:120
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:148
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73022,28 +73118,29 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:136
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:188
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:164
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:152
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:164
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:160
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:172
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73052,24 +73149,23 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:168
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v33, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:184
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:196
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:192
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:220
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:208
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73077,121 +73173,124 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:204
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:216
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:216
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:240
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:236
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:240
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:232
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:248
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:256
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:248
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:284
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:272
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:272
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:268
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:268
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:264
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:264
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:280
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:288
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:292
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:280
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:308
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:304
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:304
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:300
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:296
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:296
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v13, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:324
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:312
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:320
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:340
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:336
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:336
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:332
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:328
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:356
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:344
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:332
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:328
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v13, 8, v2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:356
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:388
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384
@@ -73199,23 +73298,23 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:376
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:372
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:376
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:372
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:364
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:360
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v23, 8, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v6
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
@@ -73258,14 +73357,14 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr29
 ; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr7
-; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr6
 ; GCN-NEXT:    ; implicit-def: $vgpr2
 ; GCN-NEXT:    ; implicit-def: $vgpr46
@@ -73275,104 +73374,106 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr18
 ; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr25
 ; GCN-NEXT:    ; implicit-def: $vgpr8
 ; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr14
 ; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr20
+; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr22
 ; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr16
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr38
-; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB44_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v14, v11
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v47, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v11, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v3, v21
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v53, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v55, v1, v15
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v52, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v40, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v55, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v41, v1, v2
+; GCN-NEXT:    v_or_b32_e32 v40, v1, v2
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v43, v1, v2
+; GCN-NEXT:    v_or_b32_e32 v41, v1, v2
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v44, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v43, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v45, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v46, v1, v2
+; GCN-NEXT:    v_or_b32_e32 v45, v1, v2
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v1, v2
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v35
-; GCN-NEXT:    v_or_b32_e32 v22, v1, v22
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v16
-; GCN-NEXT:    v_or_b32_e32 v16, v1, v52
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v56
-; GCN-NEXT:    v_or_b32_e32 v5, v1, v5
+; GCN-NEXT:    v_or_b32_e32 v33, v1, v33
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v1, v20
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v38
+; GCN-NEXT:    v_or_b32_e32 v20, v1, v57
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v53
+; GCN-NEXT:    v_or_b32_e32 v27, v1, v27
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v31
-; GCN-NEXT:    v_or_b32_e32 v13, v1, v13
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v21, v1, v21
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v29
 ; GCN-NEXT:    v_or_b32_e32 v23, v1, v23
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v1, v13
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v15
+; GCN-NEXT:    v_or_b32_e32 v13, v1, v19
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
@@ -73381,7 +73482,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73390,7 +73491,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73399,7 +73500,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73409,7 +73510,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73418,7 +73519,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73427,7 +73528,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73437,7 +73538,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73446,7 +73547,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73455,7 +73556,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73465,7 +73566,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -73474,60 +73575,60 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
@@ -73536,41 +73637,39 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v3, v2, v1
+; GCN-NEXT:    v_or_b32_e32 v29, v2, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v2, v1
+; GCN-NEXT:    v_or_b32_e32 v3, v2, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v32, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v7, v2, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v33, v2, v1
+; GCN-NEXT:    v_or_b32_e32 v32, v2, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
@@ -73584,7 +73683,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v2, v1
+; GCN-NEXT:    v_or_b32_e32 v35, v2, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
@@ -73600,11 +73699,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v8, v1
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
@@ -73615,7 +73714,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 24, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v10, v9, v8
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
@@ -73626,13 +73725,13 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v12, v8
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 24, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
@@ -73641,241 +73740,249 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v12, v8
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v26, v12, v8
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v26, v12, v8
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
+; GCN-NEXT:    v_or_b32_e32 v36, v12, v8
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v14
 ; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:844 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 24, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v25, v12, v8
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v12, v8
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v37, v14, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v14, v12
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 24, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    v_or_b32_e32 v12, v14, v12
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 24, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
+; GCN-NEXT:    v_or_b32_e32 v37, v15, v14
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v14, v20, v14
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v24, v20
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v16, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 24, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_or_b32_e32 v24, v27, v24
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 24, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    v_or_b32_e32 v22, v16, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v28, v27
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v24, v16, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v29, v27
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 24, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    v_or_b32_e32 v38, v19, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    v_or_b32_e32 v30, v19, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v19, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v28, v19, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v48, v19, v11
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v42
+; GCN-NEXT:    v_or_b32_e32 v48, v19, v15
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 24, v54
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    v_or_b32_e32 v49, v19, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    v_or_b32_e32 v49, v19, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v50, v19, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v19, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v51, v19, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v53
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v55
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v40
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v41
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v43
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v45
-; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v15
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v13
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v21
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v23
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v51, v19, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v11
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v21
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v55
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v41
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v43
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v45
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v33
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v13
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr21
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
@@ -73884,7 +73991,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
@@ -73893,7 +74000,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
@@ -73902,176 +74009,172 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr11
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
+; GCN-NEXT:    ; kill: killed $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; kill: killed $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; kill: killed $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; kill: killed $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; kill: killed $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr23
-; GCN-NEXT:    ; kill: killed $vgpr23
-; GCN-NEXT:    ; implicit-def: $vgpr23
-; GCN-NEXT:    ; kill: killed $vgpr23
-; GCN-NEXT:    ; implicit-def: $vgpr23
-; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; kill: killed $vgpr15
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr15
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr20
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr27
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; kill: killed $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:  .LBB44_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB44_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v15
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v23, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v19, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v54
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v3, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v27
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v5
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v21, v3
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 8, v19
-; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v6, v4
+; GCN-NEXT:    v_or_b32_e32 v3, v13, v3
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 8, v1
+; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v6, v5
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v6, v13, v6
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v6, v23, v6
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v8, v7
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v56
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v53
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_or_b32_e32 v5, v5, v8
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v27, v8
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 8, v1
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v9, v8
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v9, v52, v9
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v1
+; GCN-NEXT:    v_or_b32_e32 v9, v10, v9
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v38
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
+; GCN-NEXT:    v_or_b32_e32 v10, v57, v10
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v11
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:844 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 8, v1
-; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
-; GCN-NEXT:    v_or_b32_e32 v10, v11, v10
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v35
+; GCN-NEXT:    v_lshlrev_b32_e32 v12, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
-; GCN-NEXT:    v_or_b32_e32 v11, v22, v11
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v11, v12, v11
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v4
+; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v20, v4
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
@@ -74080,23 +74183,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v13, v1, v13
+; GCN-NEXT:    v_or_b32_e32 v13, v33, v13
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
-; GCN-NEXT:    v_mov_b32_e32 v2, v15
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v1, v15
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
@@ -74107,17 +74207,17 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_or_b32_e32 v16, v17, v16
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v1, v17
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
@@ -74126,48 +74226,46 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v1, v19
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v21, 8, v21
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
 ; GCN-NEXT:    v_or_b32_e32 v20, v21, v20
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v1, v21
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_or_b32_e32 v22, v23, v22
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v28, v25, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
@@ -74182,11 +74280,13 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v37, v25, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    v_or_b32_e32 v40, v2, v24
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v40, v1, v24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
@@ -74195,11 +74295,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v45, v25, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v57, v1, v24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
@@ -74210,11 +74310,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 8, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_or_b32_e32 v58, v25, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v59, v1, v24
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
@@ -74230,26 +74330,26 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v61, v1, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v62, v1, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v63, v1, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
@@ -74257,7 +74357,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
@@ -74265,7 +74365,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v39, v2, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
@@ -74273,44 +74373,44 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v48, v2, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v49, v2, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v51, v2, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v52, v2, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v2, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v55, v2, v24
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
@@ -74318,7 +74418,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v2, v24
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
@@ -74326,10 +74426,10 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v2, v25
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
@@ -74337,12 +74437,12 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v2, v26
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v2, v27
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
@@ -74382,7 +74482,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v2, v34
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
@@ -74390,7 +74490,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v35, v2, v35
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
@@ -74398,7 +74498,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v36, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v36, v2, v36
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
@@ -74406,7 +74506,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v38, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
 ; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v38, v2, v38
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
@@ -74414,7 +74514,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v50, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xff, v50
 ; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v50
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v50, v2, v50
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
@@ -74422,7 +74522,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v54, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v54, 0xff, v54
 ; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v54
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v54, v2, v54
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
@@ -74481,13 +74581,13 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v56, v2, v56
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x300, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
 ; GCN-NEXT:    v_or_b32_e32 v61, v61, v2
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
@@ -74495,17 +74595,17 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v63, v3
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v1, v4
+; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v1, v5
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v6
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, s7, v7
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, s7, v8
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, s7, v9
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, s7, v10
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, s7, v11
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, s7, v12
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, s7, v13
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, s7, v14
@@ -74529,11 +74629,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v60
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
-; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
+; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
@@ -74557,11 +74657,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xffff, v59
 ; GCN-NEXT:    v_or_b32_e32 v1, v39, v1
 ; GCN-NEXT:    v_or_b32_e32 v6, v48, v6
-; GCN-NEXT:    v_or_b32_e32 v5, v49, v5
-; GCN-NEXT:    v_or_b32_e32 v7, v51, v7
-; GCN-NEXT:    v_or_b32_e32 v8, v52, v8
-; GCN-NEXT:    v_or_b32_e32 v9, v53, v9
-; GCN-NEXT:    v_or_b32_e32 v10, v55, v10
+; GCN-NEXT:    v_or_b32_e32 v7, v49, v7
+; GCN-NEXT:    v_or_b32_e32 v8, v51, v8
+; GCN-NEXT:    v_or_b32_e32 v9, v52, v9
+; GCN-NEXT:    v_or_b32_e32 v10, v53, v10
+; GCN-NEXT:    v_or_b32_e32 v4, v55, v4
 ; GCN-NEXT:    v_or_b32_e32 v11, v24, v11
 ; GCN-NEXT:    v_or_b32_e32 v12, v25, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v26, v13
@@ -74583,27 +74683,27 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v29, v46, v57
 ; GCN-NEXT:    v_or_b32_e32 v30, v47, v58
 ; GCN-NEXT:    v_or_b32_e32 v31, v56, v59
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, s6, v61
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, s6, v2
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, s6, v3
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, s6, v4
-; GCN-NEXT:    v_add_i32_e32 v39, vcc, s6, v1
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, s6, v6
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, s6, v61
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, s6, v2
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, s6, v3
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, s6, v5
+; GCN-NEXT:    v_add_i32_e32 v48, vcc, s6, v1
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, s6, v6
 ; GCN-NEXT:    v_add_i32_e32 v51, vcc, s6, v7
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, s6, v8
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, s6, v9
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, s6, v10
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, s6, v11
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v12
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, s6, v8
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, s6, v9
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, s6, v10
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, s6, v4
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, s6, v11
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, s6, v12
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, s6, v13
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, s6, v14
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v15
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v16
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, s6, v17
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, s6, v18
-; GCN-NEXT:    v_add_i32_e32 v3, vcc, s6, v19
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, s6, v20
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, s6, v17
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v18
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, s6, v19
+; GCN-NEXT:    v_add_i32_e32 v3, vcc, s6, v20
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, s6, v21
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, s6, v22
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s6, v23
@@ -74616,60 +74716,60 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, s6, v30
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, s6, v31
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v23
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v22
-; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v22
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v21
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v20
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v20
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v19
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v18
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v18
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v17
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v16
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v16
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v15
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v14
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v14
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v13
 ; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v7
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v3
 ; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v7
-; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v3
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v6
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v6
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v4
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v3
+; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v7
+; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v12
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v6
+; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v1
@@ -74677,163 +74777,164 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v9
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v12
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v12
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v11
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v11
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v11
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v8
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v8
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v37
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v37
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v52
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v52
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v51
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v51
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v5
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v48
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v39
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v39
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v38
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v50
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v50
-; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v35
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v8
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v4
+; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v52
+; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v39
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v39
+; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v51
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v51
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v50
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v48
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v5
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v49
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v49
+; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v38
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v38
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v33
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v33
 ; GCN-NEXT:  .LBB44_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v4, v4, v5, 16
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
+; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v63
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_alignbit_b32 v5, v5, v11, 16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
+; GCN-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v63
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    v_alignbit_b32 v11, v11, v13, 16
-; GCN-NEXT:    buffer_store_dword v5, v0, s[0:3], 0 offen
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, 4, v0
+; GCN-NEXT:    buffer_store_dword v5, v4, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, 4, v0
-; GCN-NEXT:    buffer_store_dword v11, v5, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v63, v4, v5, 16
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_alignbit_b32 v63, v5, v11, 16
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v62
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v57, v4, v5, 16
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 8, v0
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v62
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_alignbit_b32 v56, v5, v11, 16
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 8, v0
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v62, v4, v5, 16
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 12, v0
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_alignbit_b32 v62, v5, v11, 16
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, 12, v0
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v61
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v61, v4, v5, 16
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 16, v0
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v61
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_alignbit_b32 v61, v5, v11, 16
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 16, v0
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_alignbit_b32 v5, v5, v11, 16
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, 20, v0
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v4, v4, v5, 16
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v5, vcc, 20, v0
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v13
-; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v60
-; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    v_alignbit_b32 v60, v13, v16, 16
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v60
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v60, v4, v13, 16
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 24, v0
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v16
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v19, 1.0, v19
-; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_alignbit_b32 v16, v16, v19, 16
+; GCN-NEXT:    v_mul_f32_e32 v15, 1.0, v15
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v15, v4, v15, 16
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 28, v0
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v22
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
 ; GCN-NEXT:    v_mul_f32_e32 v23, 1.0, v59
-; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    v_alignbit_b32 v59, v22, v23, 16
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 32, v0
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v59, v4, v23, 16
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 32, v0
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v23, 1.0, v23
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v27, 1.0, v27
-; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_alignbit_b32 v23, v23, v27, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v4, v4, v27, 16
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 36, v0
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v31, 1.0, v31
+; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v58
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v31
+; GCN-NEXT:    v_alignbit_b32 v58, v31, v33, 16
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 40, v0
+; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
 ; GCN-NEXT:    v_mul_f32_e32 v29, 1.0, v29
-; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v58
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v29
-; GCN-NEXT:    v_alignbit_b32 v58, v29, v35, 16
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 40, v0
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_alignbit_b32 v3, v3, v29, 16
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 44, v0
 ; GCN-NEXT:    v_mul_f32_e32 v7, 1.0, v7
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
+; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v56
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_alignbit_b32 v3, v7, v3, 16
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, 44, v0
-; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v32
-; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v57
-; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    v_alignbit_b32 v57, v32, v35, 16
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 48, v0
+; GCN-NEXT:    v_alignbit_b32 v7, v7, v33, 16
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 48, v0
 ; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v34
-; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v33
+; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v32
 ; GCN-NEXT:    v_lshrrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    v_alignbit_b32 v33, v34, v33, 16
+; GCN-NEXT:    v_alignbit_b32 v32, v34, v32, 16
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 52, v0
-; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
-; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v47
-; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    v_alignbit_b32 v4, v4, v35, 16
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, 56, v0
+; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
+; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v47
+; GCN-NEXT:    v_lshrrev_b32_e32 v35, 16, v35
+; GCN-NEXT:    v_alignbit_b32 v35, v35, v38, 16
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, 56, v0
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v6, 1.0, v6
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_alignbit_b32 v2, v2, v6, 16
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 60, v0
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v46
+; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v46
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v1, v1, v52, 16
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, 64, v0
+; GCN-NEXT:    v_alignbit_b32 v1, v1, v53, 16
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, 64, v0
 ; GCN-NEXT:    v_mul_f32_e32 v9, 1.0, v9
 ; GCN-NEXT:    v_mul_f32_e32 v10, 1.0, v10
 ; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
@@ -74844,51 +74945,51 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_alignbit_b32 v18, v18, v54, 16
 ; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x48, v0
-; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
-; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v17
-; GCN-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    v_alignbit_b32 v17, v36, v17, 16
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x4c, v0
 ; GCN-NEXT:    v_mul_f32_e32 v26, 1.0, v26
-; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v44
+; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v17
 ; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    v_alignbit_b32 v26, v26, v42, 16
+; GCN-NEXT:    v_alignbit_b32 v17, v26, v17, 16
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 0x4c, v0
+; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
+; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v44
+; GCN-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
+; GCN-NEXT:    v_alignbit_b32 v36, v36, v42, 16
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x50, v0
 ; GCN-NEXT:    v_mul_f32_e32 v8, 1.0, v8
 ; GCN-NEXT:    v_mul_f32_e32 v25, 1.0, v25
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
 ; GCN-NEXT:    v_alignbit_b32 v8, v8, v25, 16
 ; GCN-NEXT:    v_add_i32_e32 v25, vcc, 0x54, v0
-; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
+; GCN-NEXT:    v_mul_f32_e32 v12, 1.0, v12
 ; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v43
-; GCN-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    v_alignbit_b32 v37, v37, v43, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
+; GCN-NEXT:    v_alignbit_b32 v12, v12, v43, 16
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x58, v0
 ; GCN-NEXT:    v_mul_f32_e32 v14, 1.0, v14
-; GCN-NEXT:    v_mul_f32_e32 v12, 1.0, v12
+; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    v_alignbit_b32 v12, v14, v12, 16
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, 0x5c, v0
-; GCN-NEXT:    v_mul_f32_e32 v20, 1.0, v20
+; GCN-NEXT:    v_alignbit_b32 v14, v14, v37, 16
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 0x5c, v0
+; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
 ; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v41
-; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    v_alignbit_b32 v20, v20, v41, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v39
+; GCN-NEXT:    v_alignbit_b32 v39, v39, v41, 16
 ; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x60, v0
-; GCN-NEXT:    v_mul_f32_e32 v28, 1.0, v28
 ; GCN-NEXT:    v_mul_f32_e32 v24, 1.0, v24
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    v_alignbit_b32 v24, v28, v24, 16
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 0x64, v0
-; GCN-NEXT:    v_mul_f32_e32 v30, 1.0, v30
+; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v22
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v24
+; GCN-NEXT:    v_alignbit_b32 v22, v24, v22, 16
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 0x64, v0
+; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v16
 ; GCN-NEXT:    v_mul_f32_e32 v40, 1.0, v40
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v30
-; GCN-NEXT:    v_alignbit_b32 v30, v30, v40, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
+; GCN-NEXT:    v_alignbit_b32 v16, v16, v40, 16
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x68, v0
-; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
-; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v38
-; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    v_alignbit_b32 v38, v39, v38, 16
-; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x6c, v0
+; GCN-NEXT:    v_mul_f32_e32 v28, 1.0, v28
+; GCN-NEXT:    v_mul_f32_e32 v30, 1.0, v30
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v28
+; GCN-NEXT:    v_alignbit_b32 v28, v28, v30, 16
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 0x6c, v0
 ; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v48
 ; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v55
 ; GCN-NEXT:    v_lshrrev_b32_e32 v48, 16, v48
@@ -74900,40 +75001,42 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v49, v50, v49, 16
 ; GCN-NEXT:    v_add_i32_e32 v50, vcc, 0x74, v0
 ; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v51
-; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v53
+; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v52
 ; GCN-NEXT:    v_lshrrev_b32_e32 v51, 16, v51
-; GCN-NEXT:    v_alignbit_b32 v51, v51, v53, 16
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x78, v0
+; GCN-NEXT:    v_alignbit_b32 v51, v51, v52, 16
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
-; GCN-NEXT:    buffer_store_dword v63, v31, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v56, v15, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v62, v21, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v61, v11, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v63, v11, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v57, v21, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v62, v20, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v61, v5, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v5, v13, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v60, v19, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v16, v22, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v15, v23, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v59, v27, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v23, v29, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v58, v7, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v3, v32, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v57, v34, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v33, v35, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v4, v6, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v2, v52, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v4, v31, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v58, v29, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v33, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v7, v34, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v32, v38, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v35, v6, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v2, v53, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v1, v10, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v9, v54, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v18, v36, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v18, v26, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v17, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v26, v25, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v36, v25, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v8, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v37, v14, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v12, v41, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v20, v28, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v24, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v30, v39, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v38, v55, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v12, v37, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v14, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v39, v24, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v40, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v16, v30, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v55, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v48, v50, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v49, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v49, v52, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v51, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
@@ -75006,39 +75109,39 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:136
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:152
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:160
 ; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:168
 ; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:176
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v25
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v29
-; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v3
 ; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v5
 ; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v7
-; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
-; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v13
-; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v15
+; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v15
 ; VI-NEXT:    v_lshlrev_b16_e32 v35, 8, v17
 ; VI-NEXT:    v_lshlrev_b16_e32 v36, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v21
 ; VI-NEXT:    v_lshlrev_b16_e32 v34, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v27
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
@@ -75052,46 +75155,47 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v27
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
 ; VI-NEXT:    ; implicit-def: $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v37
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v38
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v39
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v48
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v49
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v50
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v51
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
@@ -75108,7 +75212,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v43
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:208
@@ -75116,15 +75220,15 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
@@ -75147,20 +75251,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -75173,20 +75277,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -75194,17 +75298,17 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
@@ -75220,17 +75324,17 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:316
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
@@ -75245,44 +75349,44 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v3, off, s[0:3], s32 offset:376
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:348
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v1
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v1
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v2
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v3
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:372
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
 ; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -75308,31 +75412,31 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v49 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v4, v4, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr55
 ; VI-NEXT:    ; implicit-def: $vgpr40
 ; VI-NEXT:    ; implicit-def: $vgpr41
-; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr36
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v34 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr34
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v52 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr52
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v37 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr37
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -75351,35 +75455,35 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_or_b32_sdwa v10, v63, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v61, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_or_b32_sdwa v12, v58, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v59, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_or_b32_sdwa v14, v45, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    ; implicit-def: $vgpr45
 ; VI-NEXT:    s_waitcnt vmcnt(2)
@@ -75388,26 +75492,26 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v63, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr62
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v60, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v10, v57, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v56, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr56
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v58, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr56
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -75416,35 +75520,35 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v44, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -75459,20 +75563,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -75480,189 +75584,189 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v31, v31, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    v_or_b32_sdwa v31, v31, v49 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr49
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v32, v32, v53 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v30, v30, v37 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v30, v30, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v30, v30, v38 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v30, v30, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v31, v31, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v31, v31, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    ; kill: killed $vgpr32
@@ -75767,7 +75871,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr51
 ; VI-NEXT:  .LBB44_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB44_4
@@ -75775,53 +75879,51 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v18, 0x300
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
 ; VI-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    v_add_u16_sdwa v4, v0, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(12)
 ; VI-NEXT:    v_or_b32_sdwa v29, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v4, v0, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_sdwa v0, v2, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v29, 0x300, v29
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v2, v0
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v2, 3, v2
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v52, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v51, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v50, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v2, v2, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v2, v3, v2
@@ -75831,19 +75933,19 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v1, 0x300, v1
 ; VI-NEXT:    v_or_b32_e32 v1, v1, v4
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v3, v49, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v48, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v3, v3, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
-; VI-NEXT:    v_or_b32_sdwa v4, v39, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v4, v37, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v4, 0x300, v4
 ; VI-NEXT:    v_or_b32_e32 v3, v4, v3
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
@@ -75851,14 +75953,13 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v33, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v6, 0x300, v6
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v35, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v5, 0x300, v5
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v36, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
@@ -75875,7 +75976,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v6, v32, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v32, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
@@ -75883,7 +75984,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v32
 ; VI-NEXT:    s_waitcnt vmcnt(1)
@@ -75891,78 +75992,94 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v33, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v33, v33, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v27, 0x300, v27
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v33
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v34, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v34, v34, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v34
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v35, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v35, v35, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v25, 0x300, v25
+; VI-NEXT:    v_or_b32_e32 v25, v25, v35
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v36, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v6, v7, v6
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v25, 0x300, v25
-; VI-NEXT:    v_or_b32_e32 v25, v25, v35
+; VI-NEXT:    v_add_u16_sdwa v36, v36, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v36, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_sdwa v36, v36, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v24, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
+; VI-NEXT:    v_or_b32_e32 v24, v24, v36
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v37, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v37, v37, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    v_or_b32_sdwa v24, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v23, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
-; VI-NEXT:    v_or_b32_e32 v24, v24, v36
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
+; VI-NEXT:    v_or_b32_e32 v23, v23, v37
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
 ; VI-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v7, v8, v7
-; VI-NEXT:    v_add_u16_e32 v8, 3, v61
+; VI-NEXT:    v_add_u16_e32 v8, 3, v63
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 3, v62
@@ -75971,30 +76088,30 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v8, v9, v8
-; VI-NEXT:    v_add_u16_e32 v9, 3, v63
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v59
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v9, v10, v9
-; VI-NEXT:    v_add_u16_e32 v10, 3, v60
+; VI-NEXT:    v_add_u16_e32 v10, 3, v57
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v10, v11, v10
-; VI-NEXT:    v_add_u16_e32 v11, 3, v58
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v58
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v11, v12, v11
@@ -76003,7 +76120,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v12, v13, v12
@@ -76012,7 +76129,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v13, v14, v13
@@ -76024,35 +76141,35 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    v_or_b32_e32 v14, v15, v14
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v19, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -76061,54 +76178,43 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v16, v19, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_e32 v16, v19, v16
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
-; VI-NEXT:    v_or_b32_sdwa v30, v38, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v30, v39, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v30, 0x300, v30
-; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
-; VI-NEXT:    v_or_b32_sdwa v31, v50, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v31, v51, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v31, 0x300, v31
-; VI-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
-; VI-NEXT:    v_or_b32_sdwa v21, v37, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v37, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v21, v38, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v40, v21, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v37, v37, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v40
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v23, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
-; VI-NEXT:    v_or_b32_e32 v23, v23, v37
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v38, v38, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v20, 3, v20
-; VI-NEXT:    v_or_b32_sdwa v20, v48, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_sdwa v55, v20, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v38, v38, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_e32 v30, v30, v55
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v39, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v20, 3, v20
+; VI-NEXT:    v_or_b32_sdwa v20, v49, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v55, v20, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v38
-; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_e32 v30, v30, v55
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v39, 3, v39
 ; VI-NEXT:    v_or_b32_sdwa v39, v48, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
@@ -76130,7 +76236,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v53, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v39, 3, v39
 ; VI-NEXT:    v_or_b32_sdwa v39, v49, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -76232,17 +76338,16 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
 ; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:136
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:136
 ; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:144
 ; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:152
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:160
 ; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:168
 ; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:176
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v27
-; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v29
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v3
@@ -76250,81 +76355,81 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v11
-; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v13
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v15
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v17
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v21
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v23
-; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v25
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v25
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    ; implicit-def: $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v37
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v39
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v48
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v49
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
@@ -76332,7 +76437,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
@@ -76349,7 +76454,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v43
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -76358,15 +76463,15 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
@@ -76390,20 +76495,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -76417,20 +76522,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -76439,17 +76544,17 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
@@ -76476,7 +76581,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
@@ -76496,48 +76601,48 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v1
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v3
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:372
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB44_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -76546,9 +76651,9 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s6
@@ -76577,10 +76682,10 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; implicit-def: $vgpr51
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v3, v4, v3, s6
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr48
+; GFX9-NEXT:    ; implicit-def: $vgpr39
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v36 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
@@ -76591,7 +76696,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v5, v6, v5, s6
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr34
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -76606,93 +76711,93 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v7, v8, v7, s6
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v58, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v15, v43, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr43
+; GFX9-NEXT:    v_or_b32_sdwa v15, v42, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v62, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
+; GFX9-NEXT:    ; implicit-def: $vgpr61
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v57, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v43, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr42
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v16, v17, v16, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -76707,20 +76812,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v18, v19, v18, s6
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v19, v20, v19, s6
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v20, v21, v20, s6
@@ -76728,58 +76833,58 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v21, v22, v21, s6
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v22, v23, v22, s6
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v23, v24, v23, s6
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v24, v25, v24, s6
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v25, v26, v25, s6
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v26, v27, v26, s6
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
@@ -76797,22 +76902,22 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v28, v29, v28, s6
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr53
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v37 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v29, v30, v29, s6
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr37
 ; GFX9-NEXT:    ; kill: killed $vgpr37
 ; GFX9-NEXT:    ; implicit-def: $vgpr37
@@ -76909,7 +77014,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v30, v31, v30, s6
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr49
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -77021,27 +77126,27 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_cbranch_execz .LBB44_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
@@ -77049,19 +77154,20 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
+; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v2
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v3
-; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_perm_b32 v0, v2, v0, s6
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(17)
+; GFX9-NEXT:    s_waitcnt vmcnt(16)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
@@ -77086,7 +77192,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v3, v48, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v39, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
@@ -77112,7 +77218,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_or_b32_sdwa v36, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
@@ -77136,11 +77242,11 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v37, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v37, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
-; GFX9-NEXT:    v_or_b32_sdwa v21, v39, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v21, v48, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v33, 0x300, v21
 ; GFX9-NEXT:    v_add_u16_e32 v34, 0x300, v23
 ; GFX9-NEXT:    v_perm_b32 v29, v34, v29, s6
@@ -77148,16 +77254,17 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v32, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v6, 0x300, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_perm_b32 v6, v7, v6, s6
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v38, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
@@ -77166,17 +77273,18 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v39, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v39
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v48, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -77189,45 +77297,45 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v57
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v58
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
@@ -77236,63 +77344,63 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v43
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v43
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v42
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v16
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v18
 ; GFX9-NEXT:    v_perm_b32 v17, v17, v20, s6
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v19
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v16, v18, v16, s6
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v49, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v49, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v20
 ; GFX9-NEXT:    v_perm_b32 v30, v33, v30, s6
@@ -77300,7 +77408,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v50, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v52, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -77310,14 +77418,14 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v51, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v51
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v52, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v53, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -77327,7 +77435,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v53, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v53
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -77340,7 +77448,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v55, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v55
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -77348,7 +77456,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v40, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v50, 0x300, v40
 ; GFX9-NEXT:    v_perm_b32 v21, v50, v21, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -77356,14 +77464,14 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v41, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v41
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v42, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v51, 0x300, v42
 ; GFX9-NEXT:    v_perm_b32 v20, v51, v20, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -77371,7 +77479,7 @@ define <64 x bfloat> @bitcast_v128i8_to_v64bf16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v43, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v43
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -79150,646 +79258,635 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:124
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v41
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:28
-; GCN-NEXT:    s_waitcnt expcnt(6)
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:24
-; GCN-NEXT:    s_waitcnt expcnt(5)
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:20
-; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v9
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v18
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v17
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v20, 1.0, v20
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v20
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v22
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v21
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v24
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v26
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v25
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v7, 1.0, v28
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v30
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v34
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v40
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v55
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v32
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_mul_f32_e32 v9, 1.0, v58
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v59
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v56
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v57
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v46
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v47
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v41
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:136
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:124
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_mul_f32_e32 v29, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v43
+; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v42
+; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v41
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v43
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v54
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v40
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v55
+; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v54
+; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v32
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v53
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v18, 1.0, v51
-; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v50
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v18, 1.0, v52
+; GCN-NEXT:    v_mul_f32_e32 v19, 1.0, v51
+; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v50
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v49
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v48
-; GCN-NEXT:    v_mul_f32_e32 v19, 1.0, v39
-; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v38
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v21, 1.0, v48
+; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v39
+; GCN-NEXT:    v_mul_f32_e32 v59, 1.0, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v37
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v21, 1.0, v36
-; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v35
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v27, 1.0, v36
+; GCN-NEXT:    v_mul_f32_e32 v31, 1.0, v31
+; GCN-NEXT:    v_mul_f32_e32 v9, 1.0, v35
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:104
-; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v33
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v31
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v34
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:96
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_mul_f32_e32 v23, 1.0, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v31, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v5
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_mul_f32_e32 v27, 1.0, v12
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v42
+; GCN-NEXT:    v_mul_f32_e32 v8, 1.0, v8
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v6, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v7
+; GCN-NEXT:    v_mul_f32_e32 v7, 1.0, v10
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v11
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:120
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v44
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:120
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:128
-; GCN-NEXT:    v_mul_f32_e32 v29, 1.0, v45
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v12
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v12
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_mul_f32_e32 v23, 1.0, v13
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v11
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v13
 ; GCN-NEXT:    ; implicit-def: $vgpr28
-; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr20
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr12
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr37
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr38
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr61
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; kill: killed $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr12
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB45_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v28, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v24, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v25, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v26, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v28, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v30, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v24, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v52, 16, v35
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v25, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v40, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v39
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v26, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v42, v2, v3, 16
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v30, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v36, 16, v48
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v35
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v44, v2, v3, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v37, 16, v60
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v40, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v48
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v46, v2, v3, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v38, 16, v20
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v42, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v56, v2, v3, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v22
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v36, 16, v49
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v58, v2, v3, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v7
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v32
-; GCN-NEXT:    v_alignbit_b32 v61, v2, v18, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v34
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v63
-; GCN-NEXT:    v_alignbit_b32 v15, v2, v19, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v9
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v21
-; GCN-NEXT:    v_alignbit_b32 v14, v2, v17, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v11
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v31
-; GCN-NEXT:    v_alignbit_b32 v13, v2, v27, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v43
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v29
-; GCN-NEXT:    v_alignbit_b32 v59, v2, v33, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v47
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 24, v35
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 24, v39
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 24, v48
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 24, v60
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v44, v10, v12, 16
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v37, 16, v50
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v62
+; GCN-NEXT:    v_alignbit_b32 v46, v10, v63, 16
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v38, 16, v51
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v16
+; GCN-NEXT:    v_alignbit_b32 v56, v10, v17, 16
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v52
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v18
+; GCN-NEXT:    v_alignbit_b32 v58, v10, v19, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v21
+; GCN-NEXT:    v_alignbit_b32 v61, v10, v22, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v29
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v27
+; GCN-NEXT:    v_alignbit_b32 v15, v10, v31, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v23
+; GCN-NEXT:    v_alignbit_b32 v14, v10, v34, 16
+; GCN-NEXT:    v_mov_b32_e32 v10, v12
+; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v32
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
+; GCN-NEXT:    v_alignbit_b32 v13, v6, v47, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v43
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_alignbit_b32 v33, v1, v5, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v59
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v35
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 24, v20
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v48
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 24, v22
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v7
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v49
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v34
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v50
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v9
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v51
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v11
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v52
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v43
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 24, v29
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 24, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 24, v32
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 24, v43
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v47
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 24, v59
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v62
-; GCN-NEXT:    v_lshrrev_b32_e32 v9, 24, v62
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v9
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 24, v9
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v16
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 24, v16
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v8
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 24, v8
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v1
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 24, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v7
+; GCN-NEXT:    v_lshrrev_b32_e32 v7, 24, v7
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v23
-; GCN-NEXT:    v_lshrrev_b32_e32 v16, 24, v23
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v60
+; GCN-NEXT:    v_lshrrev_b32_e32 v8, 24, v60
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v51, v52, v16, 16
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v51, v39, v8, 16
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v35, v12, v16, 16
+; GCN-NEXT:    v_alignbit_b32 v35, v20, v8, 16
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v20, v39
 ; GCN-NEXT:    buffer_store_dword v36, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v36, v36, v12, 16
+; GCN-NEXT:    v_alignbit_b32 v36, v36, v8, 16
 ; GCN-NEXT:    buffer_store_dword v37, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v37, v37, v12, 16
+; GCN-NEXT:    v_alignbit_b32 v37, v37, v8, 16
 ; GCN-NEXT:    buffer_store_dword v38, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v38, v38, v12, 16
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v38, v38, v8, 16
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v39, v6, v12, 16
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v48, v3, v12, 16
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v39, v10, v8, 16
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v49, v4, v3, 16
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v48, v16, v8, 16
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v50, v5, v3, 16
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v5, v52
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v49, v17, v8, 16
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v52, v8, v3, 16
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v50, v18, v8, 16
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v53, v10, v3, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v52, v12, v8, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v54, v2, v3, 16
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v53, v6, v8, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v55, v7, v2, 16
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v41, v9, v2, 16
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v54, v1, v6, 16
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v45, v11, v2, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v55, v3, v1, 16
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v57, v1, v2, 16
-; GCN-NEXT:    v_alignbit_b32 v4, v51, v28, 24
-; GCN-NEXT:    v_alignbit_b32 v10, v51, v28, 16
-; GCN-NEXT:    v_alignbit_b32 v3, v51, v28, 8
+; GCN-NEXT:    v_alignbit_b32 v41, v4, v1, 16
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v45, v5, v2, 16
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v57, v7, v11, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v28, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v28, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v28, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v35, v24, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v35, v24, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v8, v35, v24, 8
+; GCN-NEXT:    v_alignbit_b32 v12, v35, v24, 8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v36, v25, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v36, v25, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v12, v36, v25, 8
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v36, v25, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v37, v26, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
@@ -79798,7 +79895,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v37, v26, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v38, v30, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
@@ -79807,7 +79904,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v38, v30, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v39, v40, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
@@ -79816,7 +79913,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v39, v40, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v48, v42, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
@@ -79825,7 +79922,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v48, v42, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v49, v44, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
@@ -79834,7 +79931,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v49, v44, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v50, v46, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
@@ -79843,7 +79940,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v50, v46, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v52, v56, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
@@ -79852,7 +79949,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v52, v56, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v53, v58, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
@@ -79861,7 +79958,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v53, v58, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v54, v61, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
@@ -79870,7 +79967,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v54, v61, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v55, v15, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
@@ -79879,7 +79976,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v55, v15, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v41, v14, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
@@ -79888,7 +79985,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v41, v14, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v45, v13, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
@@ -79897,74 +79994,64 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v45, v13, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v57, v59, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v57, v33, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v57, v59, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v57, v33, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v57, v59, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v57, v33, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v51
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v35
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v36
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v37
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v38
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v39
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v48
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v49
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v50
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v52
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v53
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v54
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v55
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v41
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v45
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v57
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
@@ -79983,310 +80070,304 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr20
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr22
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr9
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr4
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr29
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr63
+; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr17
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr18
-; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr63
+; GCN-NEXT:    ; implicit-def: $vgpr18
 ; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr22
+; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr27
+; GCN-NEXT:    ; implicit-def: $vgpr31
+; GCN-NEXT:    ; implicit-def: $vgpr9
 ; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr6
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr5
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:  .LBB45_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB45_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v33
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v29
-; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
-; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
-; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    v_alignbit_b32 v59, v13, v12, 16
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v27
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v31
-; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
-; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
-; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    v_alignbit_b32 v13, v14, v13, 16
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v17
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v21
-; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
-; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    v_alignbit_b32 v14, v15, v14, 16
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v19
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v63
-; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
-; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    v_alignbit_b32 v15, v17, v15, 16
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v23
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
-; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v16
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v6
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v62
-; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v18
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v32
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_alignbit_b32 v33, v1, v5, 16
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v47
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v6
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_alignbit_b32 v13, v5, v1, 16
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v34
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v23
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_alignbit_b32 v14, v5, v1, 16
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v31
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v27
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_alignbit_b32 v15, v5, v1, 16
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v11
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v60
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v2
+; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v12
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v47
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v2
+; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v9
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v22
+; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v21
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v9
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v59
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v18
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v12
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v9
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v43
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v9
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v32
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v62
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v12
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v9
+; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v11
-; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v3
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v33, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v9
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v34
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v3
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v9
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v7
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v54, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v55, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v54, 0xffff0000, v7
-; GCN-NEXT:    v_and_b32_e32 v55, 0xffff0000, v20
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v40, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v41, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v41, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v42, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v42, 0xffff0000, v7
-; GCN-NEXT:    v_and_b32_e32 v43, 0xffff0000, v60
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v43, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v44, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v44, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v45, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v45, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v46, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v46, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v47, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v47, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v56, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v56, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v57, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v58, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v58, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v59, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v60, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v61, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v61, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v62, 0xffff0000, v7
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v62, 0xffff0000, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v7
-; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v3
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v4
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v60, 0x40c00000, v1
-; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v3
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v6
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v7
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v27
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v6
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v1
-; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v17
-; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v18
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v19
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v2
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v20
+; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v10
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v21
-; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v16
-; GCN-NEXT:    v_add_f32_e32 v3, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v12
+; GCN-NEXT:    v_add_f32_e32 v3, 0x40c00000, v11
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v24
+; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v3, 0x40c00000, v25
-; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v26
-; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v27
+; GCN-NEXT:    v_add_f32_e32 v3, 0x40c00000, v19
+; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v22
+; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v23
 ; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v27, 0x40c00000, v28
+; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v29
-; GCN-NEXT:    v_add_f32_e32 v29, 0x40c00000, v30
-; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v31
+; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v17
+; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v16
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v25
 ; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_f32_e32 v32, 0x40c00000, v32
+; GCN-NEXT:    v_add_f32_e32 v27, 0x40c00000, v26
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v33
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v28
+; GCN-NEXT:    v_add_f32_e32 v28, 0x40c00000, v30
+; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v31
+; GCN-NEXT:    v_add_f32_e32 v32, 0x40c00000, v32
+; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v34
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x40c00000, v35
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v36
-; GCN-NEXT:    v_add_f32_e32 v34, 0x40c00000, v34
-; GCN-NEXT:    v_add_f32_e32 v33, 0x40c00000, v37
-; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v38
+; GCN-NEXT:    v_add_f32_e32 v34, 0x40c00000, v29
+; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v37
+; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v38
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v39
 ; GCN-NEXT:    v_add_f32_e32 v35, 0x40c00000, v48
-; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v49
-; GCN-NEXT:    v_add_f32_e32 v28, 0x40c00000, v50
-; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v51
-; GCN-NEXT:    v_add_f32_e32 v36, 0x40c00000, v22
+; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v49
+; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v50
+; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v51
+; GCN-NEXT:    v_add_f32_e32 v36, 0x40c00000, v9
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x40c00000, v52
 ; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v53
-; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v54
+; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v54
 ; GCN-NEXT:    v_add_f32_e32 v37, 0x40c00000, v55
 ; GCN-NEXT:    v_add_f32_e32 v25, 0x40c00000, v40
 ; GCN-NEXT:    v_add_f32_e32 v50, 0x40c00000, v41
@@ -80294,46 +80375,46 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v38, 0x40c00000, v43
 ; GCN-NEXT:    v_add_f32_e32 v41, 0x40c00000, v44
 ; GCN-NEXT:    v_add_f32_e32 v52, 0x40c00000, v45
-; GCN-NEXT:    v_add_f32_e32 v43, 0x40c00000, v46
+; GCN-NEXT:    v_add_f32_e32 v29, 0x40c00000, v46
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x40c00000, v47
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x40c00000, v56
 ; GCN-NEXT:    v_add_f32_e32 v54, 0x40c00000, v57
-; GCN-NEXT:    v_add_f32_e32 v47, 0x40c00000, v58
-; GCN-NEXT:    v_add_f32_e32 v48, 0x40c00000, v11
-; GCN-NEXT:    v_add_f32_e32 v53, 0x40c00000, v9
+; GCN-NEXT:    v_add_f32_e32 v43, 0x40c00000, v58
+; GCN-NEXT:    v_add_f32_e32 v48, 0x40c00000, v59
+; GCN-NEXT:    v_add_f32_e32 v53, 0x40c00000, v60
 ; GCN-NEXT:    v_add_f32_e32 v55, 0x40c00000, v61
-; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v62
+; GCN-NEXT:    v_add_f32_e32 v47, 0x40c00000, v62
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x40c00000, v63
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v61, v1, v6, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v62, 16, v18
+; GCN-NEXT:    v_alignbit_b32 v61, v1, v2, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v60, 16, v18
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v12
-; GCN-NEXT:    v_alignbit_b32 v58, v1, v2, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v21
+; GCN-NEXT:    v_alignbit_b32 v58, v1, v3, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v59, 16, v21
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v24
-; GCN-NEXT:    v_alignbit_b32 v56, v1, v3, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v27
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v29
-; GCN-NEXT:    v_alignbit_b32 v46, v1, v4, 16
+; GCN-NEXT:    v_alignbit_b32 v56, v1, v4, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v27
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v28
+; GCN-NEXT:    v_alignbit_b32 v46, v1, v5, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v32
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    v_alignbit_b32 v44, v1, v5, 16
+; GCN-NEXT:    v_alignbit_b32 v44, v1, v6, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v34
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v7
-; GCN-NEXT:    v_alignbit_b32 v42, v1, v33, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v10
+; GCN-NEXT:    v_alignbit_b32 v42, v1, v7, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v35
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v28
-; GCN-NEXT:    v_alignbit_b32 v40, v1, v8, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v36
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v8
+; GCN-NEXT:    v_alignbit_b32 v40, v1, v22, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v36
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v26
 ; GCN-NEXT:    v_alignbit_b32 v30, v1, v30, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v37
+; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v50
 ; GCN-NEXT:    v_alignbit_b32 v26, v1, v25, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v38
+; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v38
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v52
 ; GCN-NEXT:    v_alignbit_b32 v25, v1, v41, 16
-; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v39
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v39
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v54
 ; GCN-NEXT:    v_alignbit_b32 v24, v1, v51, 16
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v48
@@ -80376,91 +80457,98 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 24, v18
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v10
-; GCN-NEXT:    v_lshrrev_b32_e32 v10, 24, v10
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v20
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v20
 ; GCN-NEXT:    v_lshrrev_b32_e32 v18, 24, v20
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v60
-; GCN-NEXT:    v_lshrrev_b32_e32 v20, 24, v60
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v20
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 24, v20
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v21
 ; GCN-NEXT:    v_lshrrev_b32_e32 v21, 24, v21
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v22
+; GCN-NEXT:    v_lshrrev_b32_e32 v22, 24, v22
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v57, v20, v21, 16
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v57, v21, v22, 16
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v45, v18, v20, 16
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v45, v20, v21, 16
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v20, v31
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v41, v10, v18, 16
+; GCN-NEXT:    v_alignbit_b32 v41, v18, v21, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v55, v1, v10, 16
-; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v55, v1, v18, 16
+; GCN-NEXT:    buffer_store_dword v60, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v54, v62, v1, 16
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v53, v12, v1, 16
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v54, v60, v1, 16
+; GCN-NEXT:    buffer_store_dword v59, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v52, v11, v1, 16
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v53, v59, v1, 16
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v50, v4, v1, 16
+; GCN-NEXT:    v_alignbit_b32 v52, v12, v1, 16
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v50, v4, v16, 16
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_alignbit_b32 v49, v3, v17, 16
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_alignbit_b32 v48, v2, v19, 16
-; GCN-NEXT:    v_alignbit_b32 v39, v6, v16, 16
-; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v38, v29, v22, 16
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v37, v8, v23, 16
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v36, v7, v43, 16
+; GCN-NEXT:    v_alignbit_b32 v39, v10, v11, 16
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v38, v8, v9, 16
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v37, v7, v23, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v36, v6, v29, 16
 ; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v35, v5, v47, 16
+; GCN-NEXT:    v_alignbit_b32 v35, v5, v43, 16
+; GCN-NEXT:    v_alignbit_b32 v51, v20, v47, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v28, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v28, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v28, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v5, v31
-; GCN-NEXT:    v_alignbit_b32 v51, v5, v9, 16
-; GCN-NEXT:    v_alignbit_b32 v4, v51, v28, 24
-; GCN-NEXT:    v_alignbit_b32 v10, v51, v28, 16
-; GCN-NEXT:    v_alignbit_b32 v3, v51, v28, 8
 ; GCN-NEXT:    v_alignbit_b32 v1, v35, v24, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v35, v24, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v8, v35, v24, 8
+; GCN-NEXT:    v_alignbit_b32 v12, v35, v24, 8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v36, v25, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v36, v25, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v12, v36, v25, 8
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v36, v25, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v37, v26, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
@@ -80469,7 +80557,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v37, v26, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v38, v30, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
@@ -80478,7 +80566,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v38, v30, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v39, v40, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
@@ -80487,7 +80575,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v39, v40, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v48, v42, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
@@ -80496,7 +80584,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v48, v42, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v49, v44, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
@@ -80505,7 +80593,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v49, v44, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v50, v46, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
@@ -80514,7 +80602,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v50, v46, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v52, v56, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
@@ -80523,7 +80611,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v52, v56, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v53, v58, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
@@ -80532,7 +80620,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v53, v58, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v54, v61, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
@@ -80541,7 +80629,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v54, v61, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v55, v15, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
@@ -80550,7 +80638,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v55, v15, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v41, v14, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
@@ -80559,7 +80647,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v41, v14, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v45, v13, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
@@ -80568,80 +80656,86 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v45, v13, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v57, v59, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v57, v33, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v57, v59, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v57, v33, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v57, v59, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v57, v33, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v51
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v35
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v36
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v37
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v38
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v39
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v48
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v49
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v50
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v52
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v53
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v54
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v55
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v41
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v45
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v57
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GCN-NEXT:  .LBB45_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v3
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v51
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v10
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
-; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v5
+; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v20
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 24, v5
@@ -80657,156 +80751,158 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v2, v1, s[0:3], 0 offen
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v12
 ; GCN-NEXT:    v_or_b32_e32 v29, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v35
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; GCN-NEXT:    v_or_b32_e32 v31, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v12
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; GCN-NEXT:    v_or_b32_e32 v2, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v36
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v62, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v26
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v6, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v37
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v30
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v38
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v40
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v7, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v39
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v8, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v42
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v9, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v48
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v10, v1, v3
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v44
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v11, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v49
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v16, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v46
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v17, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v50
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v18, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v56
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v19, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v52
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v20, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v58
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v21, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v53
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v22, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v61
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v23, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v54
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v24, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v15, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v55
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v25, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v14
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v14, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v41
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v26, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v13
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v13, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v45
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v27, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v59
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v33
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v12, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v57
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v28, v1, v3
@@ -80882,7 +80978,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v39, v3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v6
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v10
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
@@ -81059,33 +81155,30 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v62
 ; GCN-NEXT:    v_or_b32_e32 v61, v3, v34
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 16, v0
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v35
 ; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v62, vcc, 20, v0
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v36
 ; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 24, v0
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_or_b32_e32 v29, v29, v37
-; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
+; GCN-NEXT:    v_or_b32_e32 v10, v10, v37
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v29, vcc, 28, v0
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v38
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
+; GCN-NEXT:    v_or_b32_e32 v10, v10, v38
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 32, v0
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v39
@@ -81098,7 +81191,9 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v9
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v49
 ; GCN-NEXT:    v_add_i32_e32 v33, vcc, 44, v0
-; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v10
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v50
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, 48, v0
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v11
@@ -81238,8 +81333,8 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    ; implicit-def: $vgpr35
 ; VI-NEXT:    ; implicit-def: $vgpr45
@@ -82417,13 +82512,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 8, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v4, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 12, v0
@@ -82515,13 +82611,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v14, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 52, v0
@@ -82541,13 +82638,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 56, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v16, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 60, v0
@@ -82567,13 +82665,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 64, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v18, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x44, v0
@@ -82593,13 +82692,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x48, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v20, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x4c, v0
@@ -82619,13 +82719,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x50, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x54, v0
@@ -82645,13 +82746,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x58, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x5c, v0
@@ -82668,13 +82770,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x60, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x64, v0
@@ -82694,13 +82797,14 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x68, v0
 ; VI-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 0x6c, v0
@@ -82778,8 +82882,8 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_mov_b32_e32 v46, v15
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    ; kill: killed $vgpr33
@@ -82964,7 +83068,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v58, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v59, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(34)
+; GFX9-NEXT:    s_waitcnt vmcnt(33)
 ; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v15
 ; GFX9-NEXT:    ; implicit-def: $vgpr15
 ; GFX9-NEXT:    ; kill: killed $vgpr15
@@ -82989,7 +83093,6 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v2
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(36)
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 24, v32
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
@@ -83418,7 +83521,6 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX9-NEXT:    v_cmp_u_f32_e32 vcc, v15, v15
 ; GFX9-NEXT:    v_cndmask_b32_e32 v15, v16, v17, vcc
 ; GFX9-NEXT:    v_perm_b32 v33, v15, v25, s7
-; GFX9-NEXT:    s_waitcnt vmcnt(52)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v15, 16, v32
 ; GFX9-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GFX9-NEXT:    v_bfe_u32 v16, v15, 16, 1
@@ -85016,24 +85118,22 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v3, 0xffff, v36, v3
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v36, 0xffff0000, v5
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e64 v33.l, v147.h
-; GFX11-TRUE16-NEXT:    v_dual_add_f32 v34, 0x40c00000, v38 :: v_dual_add_f32 v7, 0x40c00000, v7
+; GFX11-TRUE16-NEXT:    v_dual_add_f32 v34, 0x40c00000, v38 :: v_dual_lshlrev_b32 v5, 16, v5
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v105, 8, v3
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v4, 0xffff, v33, v149
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v37, v34, 16, 1
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v33, v6, 16, 1
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v37, v34, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v38, 0x400000, v34
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v93, 24, v4
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v37, v34, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v33, v33, v6, 0x7fff
-; GFX11-TRUE16-NEXT:    v_or_b32_e32 v37, 0x400000, v6
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v95, 8, v4
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT:    v_add3_u32 v35, v37, v34, 0x7fff
+; GFX11-TRUE16-NEXT:    v_or_b32_e32 v37, 0x400000, v6
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v164, v33, v37, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v36, 0x40c00000, v36
@@ -86345,20 +86445,19 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v66, v113, v115, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v113, 0x400000, v13
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v115, 16, v15
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v163, v12, v10, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v20, 16, v18
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v18, 16, v17
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v13, v112, v113 :: v_dual_add_f32 v112, 0x40c00000, v115
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v113, v114, v67, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v114, 0x400000, v67
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v115, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v67, v67
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v116, v112, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v117, 0x400000, v112
-; GFX11-FAKE16-NEXT:    v_or_b32_e32 v118, 0x400000, v15
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v148, v13, v66, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v67, v113, v114, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v114, v115, v16, 0x7fff
@@ -86366,20 +86465,20 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v113, v15, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v116, v116, v112, 0x7fff
+; GFX11-FAKE16-NEXT:    v_or_b32_e32 v118, 0x400000, v15
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v17, 16, v33
-; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[33:34], 24, v[96:97]
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v16, v114, v115, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v112, v112
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v113, v113, v15, 0x7fff
+; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[33:34], 24, v[96:97]
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[34:35], 24, v[86:87]
-; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[35:36], 24, v[84:85]
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v135, v16, v67, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v112, v116, v117, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v15, v15
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v116, 16, v14
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v14, 16, v52
+; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[35:36], 24, v[84:85]
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v132, 16, v5
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v5, 16, v53
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v15, v113, v118, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v118, 16, v12
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v12, 16, v10
@@ -86394,6 +86493,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[49:50], 24, v[148:149]
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[50:51], 24, v[162:163]
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[51:52], 24, v[176:177]
+; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v5, 16, v53
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v117, 16, v25
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v25, 16, v37
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[52:53], 24, v[182:183]
@@ -86830,314 +86930,313 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v50, v27
-; GCN-NEXT:    v_mov_b32_e32 v49, v25
+; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v49, v23
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v27, v22
 ; GCN-NEXT:    v_mov_b32_e32 v39, v21
-; GCN-NEXT:    v_mov_b32_e32 v48, v3
-; GCN-NEXT:    v_mov_b32_e32 v37, v1
+; GCN-NEXT:    v_mov_b32_e32 v22, v20
+; GCN-NEXT:    v_mov_b32_e32 v38, v19
+; GCN-NEXT:    v_mov_b32_e32 v20, v18
+; GCN-NEXT:    v_mov_b32_e32 v37, v17
+; GCN-NEXT:    v_mov_b32_e32 v18, v16
+; GCN-NEXT:    v_mov_b32_e32 v16, v14
+; GCN-NEXT:    v_mov_b32_e32 v14, v12
+; GCN-NEXT:    v_mov_b32_e32 v12, v3
+; GCN-NEXT:    v_mov_b32_e32 v34, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:96
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:72
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:392
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:64
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:56
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:48
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:40
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:32
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v36, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v4
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v6
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v4
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v6
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v8
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v10
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v14
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 8, v18
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v16
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 8, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 8, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 8, v18
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 8, v20
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v22
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 8, v28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v27
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:392
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:112
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v24
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:124
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v26
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v28
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 8, v30
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v35
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:124
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 8, v33
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 8, v6
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v21
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v34
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v23
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v33
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v21
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v19
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v17
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v31
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v36
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v35
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v32
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v31
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:100
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v29
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:104
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:104
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(14) expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v8
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v10
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:140
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:136
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:148
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:136
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v4
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:164
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:172
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:168
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:180
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v4
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:196
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:200
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:208
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:220
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(4) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:224
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:240
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:232
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:252
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:272
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:264
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:272
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:284
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v1
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(3) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v4
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:292
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:308
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:304
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:296
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:304
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v2
@@ -87145,1518 +87244,1521 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v4
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:324
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:332
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:328
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:332
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:340
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:336
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:328
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:336
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:348
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v3
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v4
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:356
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:364
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:352
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:360
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:372
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:388
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:384
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:360
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:388
+; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:384
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:372
 ; GCN-NEXT:    v_lshlrev_b32_e32 v62, 8, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 8, v2
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:120
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v2
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:216
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:248
 ; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:280
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:312
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:344
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:312
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:380
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:376
 ; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:368
-; GCN-NEXT:    s_waitcnt vmcnt(13)
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 8, v3
+; GCN-NEXT:    s_waitcnt vmcnt(12)
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 8, v4
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v4
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr38
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr24
 ; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr4
-; GCN-NEXT:    ; implicit-def: $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr34
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr22
+; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr27
 ; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB46_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    v_mov_b32_e32 v26, v0
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v37
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v48
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v5
+; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v34
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v7
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v9
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v7
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v11
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v13
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v11
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v0, v2
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v17
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v18
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v19
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v19, v0, v2
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v39
-; GCN-NEXT:    v_or_b32_e32 v25, v0, v22
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v0, v24
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v49
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v15
+; GCN-NEXT:    v_mov_b32_e32 v9, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v2, v18
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v37
+; GCN-NEXT:    v_mov_b32_e32 v7, v21
+; GCN-NEXT:    v_or_b32_e32 v21, v2, v20
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v38
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v32, v0, v2
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v50
-; GCN-NEXT:    v_or_b32_e32 v33, v0, v28
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v29
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v19, v2, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v39
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v34, v0, v2
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v12
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v17, v2, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v49
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v0, v2
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v14
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v2, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v25
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v29, v2, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v35, v2, v4
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v32, v2, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v0
+; GCN-NEXT:    v_or_b32_e32 v33, v2, v30
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v37, v2, v4
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v35, v2, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v10
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v38, v2, v4
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v34, v2, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v14
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v36, v2, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v16
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v2, v4
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v37, v2, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v48, v2, v4
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v38, v2, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v4
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v4, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v5, v6
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v49, v6, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v48, v8, v0
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v51
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v50, v6, v5
-; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v51
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v51, v8, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GCN-NEXT:    v_mov_b32_e32 v7, v8
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v8, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v49, v10, v0
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v10, v10, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v10, v0
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v26
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v10, v0
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v28
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v14, v12, v5
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v30
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v28, v10, v0
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v7
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v12, v5
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v7
+; GCN-NEXT:    v_or_b32_e32 v10, v10, v0
+; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v9
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v5
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v5
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v5
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v5
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v5
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v53
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v53, v53, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v5
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v7
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v40, v40, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v40, v40, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v7
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v5
-; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v26
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v9
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v9
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v5, v13
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v9
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xff, v42
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v42, v42, v9
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v9
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v42, v42, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v9
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v13
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v13
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v43, v43, v13
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v15
-; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v43, v43, v15
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v15, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v28, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v17, v17, v27
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v25
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xff, v46
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v46, v46, v27
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v46, v46, v25
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v51, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v51, v51, v52
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v25
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v52
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v52, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v52, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v52, v52, v54
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v54, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v54, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v54, v54, v55
-; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v56
+; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v57
 ; GCN-NEXT:    v_or_b32_e32 v55, v55, v41
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v41, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v41, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v41, v41, v44
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v44, v44, v45
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v45, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v45, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v45, v45, v47
-; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v58
-; GCN-NEXT:    v_or_b32_e32 v47, v47, v57
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v59
+; GCN-NEXT:    v_or_b32_e32 v47, v47, v56
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v56, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v56, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v56, v56, v62
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v57, 0xff, v27
-; GCN-NEXT:    v_or_b32_e32 v31, v57, v31
-; GCN-NEXT:    v_and_b32_e32 v57, 0xff, v60
+; GCN-NEXT:    v_and_b32_e32 v57, 0xff, v25
 ; GCN-NEXT:    v_or_b32_e32 v57, v57, v63
-; GCN-NEXT:    v_and_b32_e32 v58, 0xff, v61
-; GCN-NEXT:    v_or_b32_e32 v1, v58, v1
-; GCN-NEXT:    v_and_b32_e32 v58, 0xff, v59
+; GCN-NEXT:    v_and_b32_e32 v59, 0xff, v60
+; GCN-NEXT:    v_or_b32_e32 v31, v59, v31
+; GCN-NEXT:    v_and_b32_e32 v59, 0xff, v61
+; GCN-NEXT:    v_or_b32_e32 v1, v59, v1
+; GCN-NEXT:    v_and_b32_e32 v58, 0xff, v58
 ; GCN-NEXT:    v_or_b32_e32 v3, v58, v3
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
 ; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v25
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v23
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v32
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v27
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v33
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v29
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v34
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v32
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v36
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v33
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v35
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v35
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v37
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v34
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v38
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v36
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v39
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v37
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v48
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v38
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v2
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v39
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v4
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v4
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v49
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v6
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v50
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v48
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v6
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v51
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v8
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v8
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v10
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v49
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v14
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v50
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v30
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v12
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v28
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v7
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v10
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v16
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v14
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v53
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v9
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v5
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v18
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v16
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v20
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v18
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v40
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v22
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v20
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v11
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v7
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v5
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v22
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v42
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:848 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v24
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v11
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v13
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v9
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v26
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v43
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v26
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v17
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v15
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v46
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v51
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v52
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v54
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v30
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v52
+; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v54
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v55
-; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v41
+; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v41
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v45
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v47
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v56
-; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v31
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v57
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v3
-; GCN-NEXT:    ; implicit-def: $vgpr37
-; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v45
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v47
+; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v56
+; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v57
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v31
+; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v3
+; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; implicit-def: $vgpr9
 ; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr13
 ; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr61
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr18
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr6
+; GCN-NEXT:    ; implicit-def: $vgpr26
 ; GCN-NEXT:    ; implicit-def: $vgpr28
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
-; GCN-NEXT:    ; kill: killed $vgpr0
-; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr21
+; GCN-NEXT:    ; implicit-def: $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr53
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr62
-; GCN-NEXT:    ; implicit-def: $vgpr31
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr18
+; GCN-NEXT:    ; implicit-def: $vgpr20
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr63
+; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:  .LBB46_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB46_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v59
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v58
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v6, v3, v2
+; GCN-NEXT:    v_or_b32_e32 v4, v3, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v61
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v10, v1, v2
+; GCN-NEXT:    v_or_b32_e32 v8, v1, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v60
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v16, v63, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v31, v2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v20, v31, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v0, v63, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v62, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v58
+; GCN-NEXT:    v_or_b32_e32 v0, v62, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v59
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v57, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v56, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v47, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v47, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v45, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v45, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v44, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v56
+; GCN-NEXT:    v_or_b32_e32 v0, v44, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v57
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v41, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v41, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v55, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v55, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v54, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v54, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v52, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v52, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v46
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v43
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v42
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v0
-; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v3, v24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v2
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v0, v24
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v1, v26
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v40
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v0, v26
-; GCN-NEXT:    v_mov_b32_e32 v0, v37
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v0
+; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v4, v28
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v0, v27
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v1
-; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v0
+; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v29, v0, v29
+; GCN-NEXT:    v_mov_b32_e32 v2, v30
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v0
+; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v30
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v53
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v31
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v31
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v32
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v1
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v32
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v23
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v33
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v8
-; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v8, v1, v34
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
-; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v33
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v0, v34
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v21
+; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v34
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
+; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v28
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v35
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v35
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 3, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 3, v6
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v36
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v6, v1, v36
+; GCN-NEXT:    v_mov_b32_e32 v17, v37
+; GCN-NEXT:    v_mov_b32_e32 v19, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v37, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v37
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v37
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mov_b32_e32 v21, v39
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, 3, v51
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v38
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, 3, v1
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v39, vcc, 3, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v38
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v23, v49
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, 3, v51
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xff, v39
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v39
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v1, v48
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v48, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v48
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v48
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v25, v49
-; GCN-NEXT:    v_mov_b32_e32 v27, v50
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v48
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xff, v49
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v49
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v49
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v50, 0xff, v50
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v50
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v50
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v51, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v51, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v51, 0xff, v51
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v51
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v51
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v52, 0xff, v52
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v52
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v52
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v53
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v53
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v54, vcc, 3, v2
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v53
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 3, v16
 ; GCN-NEXT:    v_and_b32_e32 v54, 0xff, v54
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v54
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v55, vcc, 3, v2
+; GCN-NEXT:    v_or_b32_e32 v16, v1, v54
+; GCN-NEXT:    v_add_i32_e32 v55, vcc, 3, v14
 ; GCN-NEXT:    v_and_b32_e32 v55, 0xff, v55
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v55
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v14
+; GCN-NEXT:    v_or_b32_e32 v14, v1, v55
+; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v10
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v10, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v14, v2, v40
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, 3, v12
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v41, 0xff, v41
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v41
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v12, v2, v41
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xff, v42
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v2, v42
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v42
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v27
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v4, v43
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v43
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v25
 ; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v44
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v44
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v44
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v45, vcc, 3, v23
 ; GCN-NEXT:    v_and_b32_e32 v45, 0xff, v45
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v3, v45
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v45
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v46, vcc, 3, v21
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xff, v46
-; GCN-NEXT:    v_or_b32_e32 v46, v22, v46
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v46
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v47, vcc, 3, v19
 ; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v47
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v47, v2, v47
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v47
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v56, vcc, 3, v17
 ; GCN-NEXT:    v_and_b32_e32 v56, 0xff, v56
-; GCN-NEXT:    v_or_b32_e32 v56, v18, v56
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v20, v56
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v57, vcc, 3, v15
 ; GCN-NEXT:    v_and_b32_e32 v57, 0xff, v57
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v57, v2, v57
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v57
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, 3, v13
 ; GCN-NEXT:    v_and_b32_e32 v58, 0xff, v58
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v58, v2, v58
+; GCN-NEXT:    v_or_b32_e32 v58, v1, v58
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, 3, v11
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xff, v59
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v59, v2, v59
+; GCN-NEXT:    v_or_b32_e32 v59, v1, v59
 ; GCN-NEXT:    v_add_i32_e32 v60, vcc, 3, v9
 ; GCN-NEXT:    v_and_b32_e32 v60, 0xff, v60
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v60, v2, v60
+; GCN-NEXT:    v_or_b32_e32 v60, v1, v60
 ; GCN-NEXT:    v_add_i32_e32 v61, vcc, 3, v7
 ; GCN-NEXT:    v_and_b32_e32 v61, 0xff, v61
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v61, v2, v61
+; GCN-NEXT:    v_or_b32_e32 v61, v1, v61
 ; GCN-NEXT:    v_add_i32_e32 v62, vcc, 3, v5
 ; GCN-NEXT:    v_and_b32_e32 v62, 0xff, v62
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v62, v2, v62
-; GCN-NEXT:    v_add_i32_e32 v63, vcc, 3, v1
+; GCN-NEXT:    v_or_b32_e32 v62, v1, v62
+; GCN-NEXT:    v_add_i32_e32 v63, vcc, 3, v12
 ; GCN-NEXT:    v_and_b32_e32 v63, 0xff, v63
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v63, v1, v63
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v45, v0, v3
+; GCN-NEXT:    v_or_b32_e32 v57, v0, v3
 ; GCN-NEXT:    s_movk_i32 s6, 0x300
-; GCN-NEXT:    v_add_i32_e32 v44, vcc, 0x300, v6
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, s6, v10
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, s6, v16
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, s6, v20
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, 0x300, v4
+; GCN-NEXT:    v_add_i32_e32 v47, vcc, s6, v8
+; GCN-NEXT:    v_add_i32_e32 v46, vcc, s6, v22
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, s6, v0
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, s6, v0
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v44, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v40, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v54, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v52, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v51, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v50, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v49, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v48, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v39, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v38, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v37, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v36, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v35, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v34, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, s6, v24
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, s6, v0
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s6, v24
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, s6, v26
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, s6, v27
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, s6, v29
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, s6, v0
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, s6, v26
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, s6, v0
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v25, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, s6, v6
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, s6, v0
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, s6, v8
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, s6, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v16
+; GCN-NEXT:    v_add_i32_e32 v5, vcc, s6, v14
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v9, vcc, s6, v9
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, s6, v5
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, s6, v14
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, s6, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, s6, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, s6, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, s6, v12
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v12
-; GCN-NEXT:    v_add_i32_e32 v46, vcc, s6, v46
-; GCN-NEXT:    v_add_i32_e32 v47, vcc, s6, v47
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, s6, v56
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, s6, v57
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, s6, v10
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, s6, v58
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, s6, v59
 ; GCN-NEXT:    v_add_i32_e32 v60, vcc, s6, v60
 ; GCN-NEXT:    v_add_i32_e32 v61, vcc, s6, v61
 ; GCN-NEXT:    v_add_i32_e32 v62, vcc, s6, v62
 ; GCN-NEXT:    v_add_i32_e32 v63, vcc, s6, v63
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, s6, v45
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v45
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v57, vcc, s6, v57
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v57
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v63
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v63
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v62
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v62
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v61
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v61
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v60
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v60
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v59
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v59
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v58
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v58
+; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v57
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v12
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v56
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v14
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v47
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v16
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v46
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v18
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v14
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v20
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v16
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v22
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v18
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v20
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v22
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v11
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v1
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
@@ -88673,421 +88775,410 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v6
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v7
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v8
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v8
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v13
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v9
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v15
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v10
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v17
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v11
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v19
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v13
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v21
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v23
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v17
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v24
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v19
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v25
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v21
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v26
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v27
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v24
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v28
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v25
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v29
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v26
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v30
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v27
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v31
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v32
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v29
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v33
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v30
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v34
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v31
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v35
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:848 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v32
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v33
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v37
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v34
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v35
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v39
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v36
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v48
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v37
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v49
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v50
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v39
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v48
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v49
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v51
-; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v52
-; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v53
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v55
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v40
-; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v41
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v42
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v44
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v51
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v0, v52
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v53
+; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v54
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v55
+; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v40
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v41
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v43
+; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v44
+; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v45
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v46
+; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v47
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v56
 ; GCN-NEXT:  .LBB46_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v3, v1
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v0, v50, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 4, v50
-; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v6, v3
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v1, v38, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 4, v38
+; GCN-NEXT:    buffer_store_dword v3, v1, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v45, v1, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v45, v3, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v44, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v59, vcc, 8, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v44, v3, v1
+; GCN-NEXT:    v_add_i32_e32 v59, vcc, 8, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v47, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v58, vcc, 12, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v47, v3, v1
+; GCN-NEXT:    v_add_i32_e32 v58, vcc, 12, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v46, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, 16, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v46, v3, v1
+; GCN-NEXT:    v_add_i32_e32 v57, vcc, 16, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, 20, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v3, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, 20, v38
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v3, vcc, 24, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v3, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v3, vcc, 24, v38
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, 28, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v6, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 28, v38
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 32, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v10, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, 32, v38
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, 36, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v10, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 36, v38
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 40, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v63, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 40, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 44, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v60, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 44, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v63, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 48, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v9, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 48, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v60, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 52, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v28, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 52, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v30, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 56, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v31, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 56, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v61, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, 60, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:844 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v61, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 60, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:844 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v7, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v39, vcc, 64, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v18, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, 64, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v62, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x44, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v62, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x44, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v5, v1, v0
-; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x48, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v14, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x48, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v52, v12, v0
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x4c, v50
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v52, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x4c, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v54, v12, v0
-; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x50, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v32
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v1, v11, v0
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x54, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v20
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v54, v10, v1
+; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x50, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v17
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v0
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x58, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v26
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v7, v1
+; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x54, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v15, v12, v0
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x5c, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v6
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v10, v7
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x58, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v50
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v10
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x5c, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v19, v6, v0
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x60, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v27
+; GCN-NEXT:    v_or_b32_e32 v17, v10, v0
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x60, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v24
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v29
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v9, v6, v0
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 0x64, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v33
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v0
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 0x64, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v48
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v13, v2, v0
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x68, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v4
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v0
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 0x68, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v22
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v17, v2, v0
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x6c, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v21
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v10
+; GCN-NEXT:    v_or_b32_e32 v15, v4, v0
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 0x6c, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v21, v2, v0
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 0x70, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v34
+; GCN-NEXT:    v_or_b32_e32 v19, v4, v0
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x70, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v25, v2, v0
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 0x74, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v35
+; GCN-NEXT:    v_or_b32_e32 v23, v4, v0
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 0x74, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v29, v2, v0
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x78, v50
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v50
+; GCN-NEXT:    v_or_b32_e32 v27, v4, v0
+; GCN-NEXT:    v_add_i32_e32 v35, vcc, 0x78, v38
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v38
 ; GCN-NEXT:    buffer_store_dword v45, v59, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v44, v58, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v47, v57, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v46, v56, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v3, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v48, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v8, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_store_dword v4, v3, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v14, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v6, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v18, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v12, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v22, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v16, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, v24, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v63, v28, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v60, v31, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v30, v37, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v20, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v63, v11, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v60, v26, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v9, v30, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v34, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v31, v37, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v61, v39, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v7, v49, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v18, v49, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v62, v51, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v5, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v14, v53, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v52, v55, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v54, v40, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v1, v41, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v11, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v15, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v19, v23, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v9, v27, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v13, v32, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v17, v33, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v21, v34, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v25, v36, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v29, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v7, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v13, v43, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v17, v21, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v5, v25, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v2, v29, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v15, v32, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v19, v33, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v23, v35, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v27, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
@@ -89159,39 +89250,39 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:136
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:152
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:160
 ; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:168
 ; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:176
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v25
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v29
-; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v3
 ; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v5
 ; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v7
-; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
-; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v13
-; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v15
+; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v15
 ; VI-NEXT:    v_lshlrev_b16_e32 v35, 8, v17
 ; VI-NEXT:    v_lshlrev_b16_e32 v36, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v21
 ; VI-NEXT:    v_lshlrev_b16_e32 v34, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v27
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
@@ -89205,46 +89296,47 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v27
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
 ; VI-NEXT:    ; implicit-def: $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v37
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v38
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v39
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v48
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v49
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v50
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v51
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
@@ -89261,7 +89353,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v43
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:208
@@ -89269,15 +89361,15 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
@@ -89300,20 +89392,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -89326,20 +89418,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -89347,17 +89439,17 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
@@ -89373,17 +89465,17 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:316
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
@@ -89398,44 +89490,44 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v3, off, s[0:3], s32 offset:376
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:348
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v1
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v1
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v2
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v3
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:372
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
 ; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -89461,31 +89553,31 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v49 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v4, v4, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr55
 ; VI-NEXT:    ; implicit-def: $vgpr40
 ; VI-NEXT:    ; implicit-def: $vgpr41
-; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr36
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v34 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr34
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v52 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr52
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v37 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr37
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -89504,35 +89596,35 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_or_b32_sdwa v10, v63, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v61, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_or_b32_sdwa v12, v58, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v59, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_or_b32_sdwa v14, v45, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    ; implicit-def: $vgpr45
 ; VI-NEXT:    s_waitcnt vmcnt(2)
@@ -89541,26 +89633,26 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v63, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr62
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v60, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v10, v57, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v56, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr56
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v58, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr56
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -89569,35 +89661,35 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v44, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -89612,20 +89704,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -89633,189 +89725,189 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v31, v31, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    v_or_b32_sdwa v31, v31, v49 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr49
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v32, v32, v53 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v30, v30, v37 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v30, v30, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v30, v30, v38 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v30, v30, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v31, v31, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v31, v31, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    ; kill: killed $vgpr32
@@ -89920,7 +90012,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr51
 ; VI-NEXT:  .LBB46_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB46_4
@@ -89928,53 +90020,51 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v18, 0x300
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
 ; VI-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    v_add_u16_sdwa v4, v0, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(12)
 ; VI-NEXT:    v_or_b32_sdwa v29, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v4, v0, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_sdwa v0, v2, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v29, 0x300, v29
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v2, v0
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v2, 3, v2
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v52, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v51, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v50, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v2, v2, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v2, v3, v2
@@ -89984,19 +90074,19 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v1, 0x300, v1
 ; VI-NEXT:    v_or_b32_e32 v1, v1, v4
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v3, v49, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v48, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v3, v3, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
-; VI-NEXT:    v_or_b32_sdwa v4, v39, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v4, v37, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v4, 0x300, v4
 ; VI-NEXT:    v_or_b32_e32 v3, v4, v3
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
@@ -90004,14 +90094,13 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v33, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v6, 0x300, v6
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v35, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v5, 0x300, v5
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v36, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
@@ -90028,7 +90117,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v6, v32, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v32, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
@@ -90036,7 +90125,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v32
 ; VI-NEXT:    s_waitcnt vmcnt(1)
@@ -90044,78 +90133,94 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v33, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v33, v33, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v27, 0x300, v27
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v33
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v34, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v34, v34, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v34
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v35, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v35, v35, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v25, 0x300, v25
+; VI-NEXT:    v_or_b32_e32 v25, v25, v35
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v36, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v6, v7, v6
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v25, 0x300, v25
-; VI-NEXT:    v_or_b32_e32 v25, v25, v35
+; VI-NEXT:    v_add_u16_sdwa v36, v36, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v36, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_sdwa v36, v36, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v24, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
+; VI-NEXT:    v_or_b32_e32 v24, v24, v36
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v37, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v37, v37, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    v_or_b32_sdwa v24, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v23, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
-; VI-NEXT:    v_or_b32_e32 v24, v24, v36
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
+; VI-NEXT:    v_or_b32_e32 v23, v23, v37
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
 ; VI-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v7, v8, v7
-; VI-NEXT:    v_add_u16_e32 v8, 3, v61
+; VI-NEXT:    v_add_u16_e32 v8, 3, v63
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 3, v62
@@ -90124,30 +90229,30 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v8, v9, v8
-; VI-NEXT:    v_add_u16_e32 v9, 3, v63
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v59
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v9, v10, v9
-; VI-NEXT:    v_add_u16_e32 v10, 3, v60
+; VI-NEXT:    v_add_u16_e32 v10, 3, v57
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v10, v11, v10
-; VI-NEXT:    v_add_u16_e32 v11, 3, v58
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v58
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v11, v12, v11
@@ -90156,7 +90261,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v12, v13, v12
@@ -90165,7 +90270,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v13, v14, v13
@@ -90177,35 +90282,35 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    v_or_b32_e32 v14, v15, v14
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v19, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -90214,54 +90319,43 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v16, v19, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_e32 v16, v19, v16
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
-; VI-NEXT:    v_or_b32_sdwa v30, v38, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v30, v39, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v30, 0x300, v30
-; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
-; VI-NEXT:    v_or_b32_sdwa v31, v50, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v31, v51, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v31, 0x300, v31
-; VI-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
-; VI-NEXT:    v_or_b32_sdwa v21, v37, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v37, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v21, v38, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v40, v21, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v37, v37, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v40
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v23, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
-; VI-NEXT:    v_or_b32_e32 v23, v23, v37
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v38, v38, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v20, 3, v20
-; VI-NEXT:    v_or_b32_sdwa v20, v48, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_sdwa v55, v20, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v38, v38, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_e32 v30, v30, v55
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v39, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v20, 3, v20
+; VI-NEXT:    v_or_b32_sdwa v20, v49, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v55, v20, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v38
-; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_e32 v30, v30, v55
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v39, 3, v39
 ; VI-NEXT:    v_or_b32_sdwa v39, v48, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
@@ -90283,7 +90377,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v53, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v39, 3, v39
 ; VI-NEXT:    v_or_b32_sdwa v39, v49, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -90385,17 +90479,16 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
 ; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:136
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:136
 ; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:144
 ; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:152
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:160
 ; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:168
 ; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:176
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v27
-; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v29
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v3
@@ -90403,81 +90496,81 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v11
-; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v13
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v15
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v17
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v21
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v23
-; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v25
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v25
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    ; implicit-def: $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v37
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v39
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v48
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v49
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
@@ -90485,7 +90578,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
@@ -90502,7 +90595,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v43
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -90511,15 +90604,15 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
@@ -90543,20 +90636,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -90570,20 +90663,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -90592,17 +90685,17 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
@@ -90629,7 +90722,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
@@ -90649,48 +90742,48 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v1
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v3
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:372
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB46_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -90699,9 +90792,9 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s6
@@ -90730,10 +90823,10 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; implicit-def: $vgpr51
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v3, v4, v3, s6
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr48
+; GFX9-NEXT:    ; implicit-def: $vgpr39
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v36 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
@@ -90744,7 +90837,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v5, v6, v5, s6
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr34
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -90759,93 +90852,93 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v7, v8, v7, s6
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v58, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v15, v43, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr43
+; GFX9-NEXT:    v_or_b32_sdwa v15, v42, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v62, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
+; GFX9-NEXT:    ; implicit-def: $vgpr61
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v57, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v43, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr42
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v16, v17, v16, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -90860,20 +90953,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v18, v19, v18, s6
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v19, v20, v19, s6
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v20, v21, v20, s6
@@ -90881,58 +90974,58 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v21, v22, v21, s6
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v22, v23, v22, s6
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v23, v24, v23, s6
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v24, v25, v24, s6
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v25, v26, v25, s6
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v26, v27, v26, s6
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
@@ -90950,22 +91043,22 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v28, v29, v28, s6
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr53
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v37 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v29, v30, v29, s6
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr37
 ; GFX9-NEXT:    ; kill: killed $vgpr37
 ; GFX9-NEXT:    ; implicit-def: $vgpr37
@@ -91062,7 +91155,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v30, v31, v30, s6
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr49
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -91174,27 +91267,27 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_cbranch_execz .LBB46_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
@@ -91202,19 +91295,20 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
+; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v2
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v3
-; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_perm_b32 v0, v2, v0, s6
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(17)
+; GFX9-NEXT:    s_waitcnt vmcnt(16)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
@@ -91239,7 +91333,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v3, v48, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v39, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
@@ -91265,7 +91359,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_or_b32_sdwa v36, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
@@ -91289,11 +91383,11 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v37, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v37, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
-; GFX9-NEXT:    v_or_b32_sdwa v21, v39, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v21, v48, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v33, 0x300, v21
 ; GFX9-NEXT:    v_add_u16_e32 v34, 0x300, v23
 ; GFX9-NEXT:    v_perm_b32 v29, v34, v29, s6
@@ -91301,16 +91395,17 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v32, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v6, 0x300, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_perm_b32 v6, v7, v6, s6
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v38, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
@@ -91319,17 +91414,18 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v39, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v39
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v48, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -91342,45 +91438,45 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v57
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v58
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
@@ -91389,63 +91485,63 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v43
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v43
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v42
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v16
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v18
 ; GFX9-NEXT:    v_perm_b32 v17, v17, v20, s6
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v19
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v16, v18, v16, s6
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v49, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v49, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v20
 ; GFX9-NEXT:    v_perm_b32 v30, v33, v30, s6
@@ -91453,7 +91549,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v50, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v52, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -91463,14 +91559,14 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v51, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v51
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v52, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v53, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -91480,7 +91576,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v53, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v53
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -91493,7 +91589,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v55, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v55
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -91501,7 +91597,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v40, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v50, 0x300, v40
 ; GFX9-NEXT:    v_perm_b32 v21, v50, v21, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -91509,14 +91605,14 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v41, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v41
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v42, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v51, 0x300, v42
 ; GFX9-NEXT:    v_perm_b32 v20, v51, v20, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -91524,7 +91620,7 @@ define <64 x half> @bitcast_v128i8_to_v64f16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v43, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v43
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -93303,840 +93399,820 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:124
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:44
-; GCN-NEXT:    s_waitcnt expcnt(6)
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:40
-; GCN-NEXT:    s_waitcnt expcnt(5)
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:36
-; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:32
-; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:28
-; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:24
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v1
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v4
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v6
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
-; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v10
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v11
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v7
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v13
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v16
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v15
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v10
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v20
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v21
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v24
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v23
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v26
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v11
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v25
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v14
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v28
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v27
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v29
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v15
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v18
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v44
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v20
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v37
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v19
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v36
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v62
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v21
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v53
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v60
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v23
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v26
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v61
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v58
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v59
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v56
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v28
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v57
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v27
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v42
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v41
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:136
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:124
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v47
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v40
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v46
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v45
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v55
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v41
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v54
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v55
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v54
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v53
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v52
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v52
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v51
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v49
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v38
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v50
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v37
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:104
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v38
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v36
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v33
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v35
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:96
 ; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v17
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v46
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v18
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v19
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v8
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v15
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:120
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:128
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v47
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:128
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v16
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v14
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v5
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v15
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v16
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v15
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(1) expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v17
 ; GCN-NEXT:    ; implicit-def: $vgpr61
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr62
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr52
 ; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr28
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr23
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr24
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr20
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr22
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr18
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr21
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; kill: killed $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; kill: killed $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr15
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB47_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v31
-; GCN-NEXT:    v_or_b32_e32 v61, v32, v14
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v35
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v62, v15, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    v_or_b32_e32 v55, v5, v6
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v61, v10, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v32
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v59
-; GCN-NEXT:    v_or_b32_e32 v43, v7, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v63
-; GCN-NEXT:    v_or_b32_e32 v41, v9, v5
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v62, v10, v15
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v47, v10, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v40, v2, v1
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v58, v11, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v54, v4, v1
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v55, v10, v15
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v60, v12, v1
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v59
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v24
-; GCN-NEXT:    v_or_b32_e32 v44, v13, v2
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v43, v10, v15
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v30
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v51, v5, v1
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v41, v10, v15
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v34
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v24
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v45, v6, v3
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v21
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v36
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v47, v10, v15
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v25, v7, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v37
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v46, v9, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v18
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v39
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v40, v10, v15
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v26, v10, v1
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v20
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v48
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v58, v10, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v39
+; GCN-NEXT:    v_or_b32_e32 v54, v48, v15
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v56, v11, v5
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v49
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v49
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v10, v15
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v12, v3
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v50
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v50
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v10, v16
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v57, v13, v6
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v52
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v21
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v30
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v28, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v17
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v51, v10, v15
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v42, v33, v7
-; GCN-NEXT:    v_bfe_u32 v7, v35, 8, 8
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v45, v11, v17
+; GCN-NEXT:    v_mov_b32_e32 v11, v32
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v26, v19, v16
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v46, v19, v18
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
+; GCN-NEXT:    v_or_b32_e32 v27, v2, v15
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v49
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v36
+; GCN-NEXT:    v_or_b32_e32 v56, v1, v6
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v36
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v28, v23, v17
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v39
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v37
+; GCN-NEXT:    v_or_b32_e32 v57, v63, v7
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v35, v7, v4
-; GCN-NEXT:    v_bfe_u32 v4, v59, 8, 8
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v31
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v38
+; GCN-NEXT:    v_or_b32_e32 v29, v12, v16
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v30
+; GCN-NEXT:    v_or_b32_e32 v42, v13, v3
+; GCN-NEXT:    v_bfe_u32 v3, v11, 8, 8
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v59, v4, v9
-; GCN-NEXT:    v_bfe_u32 v4, v23, 8, 8
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v35, v3, v18
+; GCN-NEXT:    v_bfe_u32 v3, v59, 8, 8
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v59, v13, v9
+; GCN-NEXT:    v_bfe_u32 v9, v24, 8, 8
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v4, v1
-; GCN-NEXT:    v_bfe_u32 v1, v32, 8, 8
+; GCN-NEXT:    v_or_b32_e32 v24, v9, v2
+; GCN-NEXT:    v_bfe_u32 v2, v20, 8, 8
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v20, v14, v15
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_bfe_u32 v2, v22, 8, 8
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v22, v53, v1
+; GCN-NEXT:    v_bfe_u32 v1, v48, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v18, v33, v6
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_bfe_u32 v1, v21, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v21, v25, v17
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_bfe_u32 v1, v10, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v16, v4, v19
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v1, v10
-; GCN-NEXT:    v_bfe_u32 v1, v31, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v32, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v32, v11
+; GCN-NEXT:    v_or_b32_e32 v19, v34, v7
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v1, v5
-; GCN-NEXT:    v_bfe_u32 v1, v22, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v52, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v52, v3
+; GCN-NEXT:    v_or_b32_e32 v15, v8, v23
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v1, v11
-; GCN-NEXT:    v_bfe_u32 v1, v16, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v50, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v17, v5, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v1, v3
-; GCN-NEXT:    v_bfe_u32 v1, v15, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v49, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v15, v8, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v1, v21, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v36, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v3, v1, v6
-; GCN-NEXT:    v_bfe_u32 v1, v20, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v39, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v16, v38, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v1, v18, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v31, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v10, v62, v61, 24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v29, v2
-; GCN-NEXT:    v_bfe_u32 v2, v14, 8, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v62, v61, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v2, v24, 8, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v62, v61, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v2, v30, 8, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v43, v55, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v2, v19, 8, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v43, v55, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v11, v43, v55, 8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v62, v61, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v47, v41, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v62, v61, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v47, v41, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v62, v61, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v47, v41, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v43, v55, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v58, v40, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v43, v55, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v33, v43, v55, 8
+; GCN-NEXT:    v_alignbit_b32 v1, v58, v40, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v47, v41, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v58, v40, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v47, v41, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v60, v54, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v47, v41, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v60, v54, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v58, v40, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v60, v54, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v58, v40, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v44, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v58, v40, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v44, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v60, v54, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v51, v44, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v60, v54, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v26, v45, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v60, v54, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v26, v45, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v51, v44, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v26, v45, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v51, v44, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v27, v46, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v51, v44, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v27, v46, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v25, v45, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v27, v46, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v25, v45, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v28, v56, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v25, v45, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v28, v56, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v26, v46, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v28, v56, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v26, v46, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v29, v57, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v26, v46, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v29, v57, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v27, v56, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v29, v57, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v27, v56, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v27, v56, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v28, v57, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v28, v57, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v24, v59, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v28, v57, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v24, v59, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v35, v42, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v24, v59, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v35, v42, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v22, v20, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v35, v42, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v22, v20, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v23, v59, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v22, v20, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v23, v59, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v21, v18, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v23, v59, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v21, v18, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v9, v7, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v21, v18, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v9, v7, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v2, v9, v7, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v19, v16, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v5, v4, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v19, v16, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v5, v4, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v2, v5, v4, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v19, v16, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v3, v15, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v17, v15, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v3, v15, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v17, v15, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v3, v15, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v17, v15, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v1, v16, 24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v62
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v1, v16, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v43
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v2, v1, v16, 8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v47
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v62
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v58
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v43
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v60
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v47
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v51
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v58
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v26
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v60
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v27
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v51
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v28
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v25
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v26
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v35
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v27
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v28
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v22
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v35
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v21
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v23
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v9
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v5
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v53, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v19
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 8, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v14, v1
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v1, v17, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
-; GCN-NEXT:    ; implicit-def: $vgpr31
-; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    v_bfe_u32 v1, v30, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr2
 ; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr5
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr9
-; GCN-NEXT:    ; implicit-def: $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr11
-; GCN-NEXT:    ; implicit-def: $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr4
 ; GCN-NEXT:    ; implicit-def: $vgpr12
-; GCN-NEXT:    ; implicit-def: $vgpr22
+; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr34
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr9
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; kill: killed $vgpr4
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr31
+; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr25
 ; GCN-NEXT:    ; implicit-def: $vgpr37
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; kill: killed $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; kill: killed $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr38
-; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:  .LBB47_2: ; %Flow
-; GCN-NEXT:    s_or_saveexec_b64 s[4:5], s[4:5]
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_xor_b64 exec, exec, s[4:5]
+; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB47_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v52
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v38
-; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
-; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v15, v14
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v17
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v29
-; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
-; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v15
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v14
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v14, v15, v16
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v50
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
@@ -94144,612 +94220,632 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v15
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v19
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v14
-; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v8
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v17
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v49
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v14
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v17, v8
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v8
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v37
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v34
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v4
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v31
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v33
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v14
-; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v19
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v25
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v4
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v21, v4, v5
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v36
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v14
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v48
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v14
-; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v8, v19, v8
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v53
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v4
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v22, v4, v5
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v9
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v14
-; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v21
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v18
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v21
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v39
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v13
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
+; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
+; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v63
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v37
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v30
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v50
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v49
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v39
+; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v48
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
-; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v42, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v43, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
-; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v63
-; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v47, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v42, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v43, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v14
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v14
+; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v47, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v15
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
+; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
+; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
+; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
+; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
+; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
+; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
+; GCN-NEXT:    v_add_f32_e32 v24, 0x38000000, v24
+; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
+; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
+; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
 ; GCN-NEXT:    v_add_f32_e32 v25, 0x38000000, v25
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
 ; GCN-NEXT:    v_add_f32_e32 v26, 0x38000000, v26
 ; GCN-NEXT:    v_add_f32_e32 v27, 0x38000000, v27
-; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
-; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_add_f32_e32 v57, 0x38000000, v28
+; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_add_f32_e32 v58, 0x38000000, v29
+; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
+; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
+; GCN-NEXT:    v_add_f32_e32 v33, 0x38000000, v33
+; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
 ; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
 ; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
-; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
+; GCN-NEXT:    v_add_f32_e32 v59, 0x38000000, v37
 ; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
-; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
 ; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
 ; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
-; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
 ; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
+; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
 ; GCN-NEXT:    v_add_f32_e32 v54, 0x38000000, v54
-; GCN-NEXT:    v_add_f32_e32 v57, 0x38000000, v24
 ; GCN-NEXT:    v_add_f32_e32 v55, 0x38000000, v55
 ; GCN-NEXT:    v_add_f32_e32 v40, 0x38000000, v40
 ; GCN-NEXT:    v_add_f32_e32 v41, 0x38000000, v41
-; GCN-NEXT:    v_add_f32_e32 v58, 0x38000000, v22
-; GCN-NEXT:    v_add_f32_e32 v59, 0x38000000, v13
 ; GCN-NEXT:    v_add_f32_e32 v42, 0x38000000, v42
 ; GCN-NEXT:    v_add_f32_e32 v43, 0x38000000, v43
-; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
-; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
 ; GCN-NEXT:    v_add_f32_e32 v44, 0x38000000, v44
-; GCN-NEXT:    v_add_f32_e32 v60, 0x38000000, v12
-; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
-; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
 ; GCN-NEXT:    v_add_f32_e32 v45, 0x38000000, v45
-; GCN-NEXT:    v_add_f32_e32 v61, 0x38000000, v11
 ; GCN-NEXT:    v_add_f32_e32 v46, 0x38000000, v46
-; GCN-NEXT:    v_add_f32_e32 v62, 0x38000000, v9
 ; GCN-NEXT:    v_add_f32_e32 v47, 0x38000000, v47
-; GCN-NEXT:    v_add_f32_e32 v63, 0x38000000, v10
-; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v6
-; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v5
+; GCN-NEXT:    v_add_f32_e32 v32, 0x38000000, v32
 ; GCN-NEXT:    v_add_f32_e32 v56, 0x38000000, v56
-; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v7
-; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v31
-; GCN-NEXT:    v_add_f32_e32 v33, 0x38000000, v32
-; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v17
-; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v26
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v27
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v28
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v29
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v37
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v34
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v48
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v49
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v30
-; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v51
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v52
-; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v54
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v57
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v55
-; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v40
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v41
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v58
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v59
-; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v42
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v43
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v3
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v60
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v1
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v61
-; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v62
-; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v63
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v53
-; GCN-NEXT:    v_mov_b32_e32 v53, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v56
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v15
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v33
-; GCN-NEXT:    v_mov_b32_e32 v33, v16
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v17
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v59, v23, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v23, v19, v5
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v21
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v10
-; GCN-NEXT:    v_or_b32_e32 v42, v9, v6
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v38
-; GCN-NEXT:    v_or_b32_e32 v35, v35, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v39
-; GCN-NEXT:    v_or_b32_e32 v57, v28, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v51
-; GCN-NEXT:    v_or_b32_e32 v28, v27, v6
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v55
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v36
-; GCN-NEXT:    v_or_b32_e32 v56, v13, v8
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v41
-; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v29
-; GCN-NEXT:    v_or_b32_e32 v27, v22, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v58
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v46, v24, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v60
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    v_or_b32_e32 v45, v34, v10
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v18
-; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v6
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v62
-; GCN-NEXT:    v_or_b32_e32 v44, v12, v14
-; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    v_bfe_u32 v5, v62, 8, 8
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v51, v40, v38
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v9
+; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v11
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v13
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v14
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v24
+; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v25
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v26
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v57
+; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v58
+; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v33
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v34
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v35
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v36
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v59
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v39
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v49
+; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v53
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v54
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v55
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v40
+; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v41
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v42
+; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v45
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v46
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v47
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
+; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v56
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v60
+; GCN-NEXT:    v_or_b32_e32 v59, v61, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v18
+; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v63
+; GCN-NEXT:    v_or_b32_e32 v24, v62, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    v_or_b32_e32 v42, v3, v35
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_or_b32_e32 v35, v4, v38
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v57, v29, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v19
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v58
+; GCN-NEXT:    v_or_b32_e32 v29, v28, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v37
+; GCN-NEXT:    v_or_b32_e32 v56, v13, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v30
+; GCN-NEXT:    v_or_b32_e32 v28, v14, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v39
+; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v46, v23, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v15
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
+; GCN-NEXT:    v_or_b32_e32 v45, v26, v51
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v54
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
+; GCN-NEXT:    v_or_b32_e32 v26, v25, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v32
+; GCN-NEXT:    v_or_b32_e32 v44, v12, v52
+; GCN-NEXT:    v_bfe_u32 v1, v32, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v51, v11, v38
+; GCN-NEXT:    buffer_store_dword v54, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_bfe_u32 v52, v54, 8, 8
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v54, v10, v13
+; GCN-NEXT:    buffer_store_dword v40, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_bfe_u32 v1, v40, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v60, v9, v30
+; GCN-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v5, v18, 8, 8
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v54, v54, v13
-; GCN-NEXT:    buffer_store_dword v60, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_bfe_u32 v1, v48, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v40, v8, v14
+; GCN-NEXT:    buffer_store_dword v39, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v5, v60, 8, 8
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_bfe_u32 v1, v39, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v58, v7, v31
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_or_b32_e32 v60, v52, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v1, v1, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v16, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v41, v49, v23
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_bfe_u32 v1, v2, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v47, v50, v34
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_bfe_u32 v1, v19, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v40, v50, v22
-; GCN-NEXT:    buffer_store_dword v58, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v58, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v55, v55, v33
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_or_b32_e32 v58, v43, v30
-; GCN-NEXT:    buffer_store_dword v41, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_bfe_u32 v1, v6, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v43, v43, v37
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v41, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v17, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_or_b32_e32 v41, v61, v24
-; GCN-NEXT:    buffer_store_dword v55, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v61, v53, v36
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v55, 8, 8
+; GCN-NEXT:    v_bfe_u32 v1, v20, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v47, v47, v32
-; GCN-NEXT:    buffer_store_dword v49, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v49, 8, 8
+; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v62, v1, v25
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_bfe_u32 v1, v18, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v55, v4, v31
-; GCN-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v48, 8, 8
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_bfe_u32 v1, v1, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v43, v3, v36
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v11, 8, 8
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_bfe_u32 v1, v1, 8, 8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v61, v2, v34
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v21, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v62, v15, v37
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v20, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v53, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v53, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_bfe_u32 v1, v33, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14) expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v1, v19, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v62, v61, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_bfe_u32 v1, v1, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v10, v62, v61, 24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v62, v61, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v62, v61, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v43, v55, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v43, v55, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v33, v43, v55, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v11, v43, v55, 8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v47, v41, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v47, v41, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v47, v41, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v58, v40, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v58, v40, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v58, v40, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v60, v54, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v60, v54, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v60, v54, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v51, v44, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v51, v44, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_alignbit_b32 v1, v51, v44, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v26, v45, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v26, v45, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v25, v45, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v26, v45, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v27, v46, 24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v27, v46, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v27, v46, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v28, v56, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v25, v45, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v28, v56, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v25, v45, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v28, v56, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v26, v46, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v29, v57, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v26, v46, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v29, v57, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v26, v46, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v29, v57, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v27, v56, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v27, v56, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v27, v56, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v28, v57, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v24, v59, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v28, v57, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v24, v59, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v28, v57, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v24, v59, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v22, v20, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v22, v20, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v35, v42, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v22, v20, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v23, v59, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v21, v18, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v23, v59, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v21, v18, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v23, v59, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v3, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v21, v18, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v19, v16, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v3, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v19, v16, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v3, 8
-; GCN-NEXT:    v_mov_b32_e32 v5, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v3, 24
+; GCN-NEXT:    v_alignbit_b32 v1, v19, v16, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v17, v15, 24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v3, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v17, v15, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v3, 8
-; GCN-NEXT:    v_mov_b32_e32 v4, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v3, v15, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v3, v15, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v3, v15, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v16, 24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v16, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v2, v16, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v1, v17, v15, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v62
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v43
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v47
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v58
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v60
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v51
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v26
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v35
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v24
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v22
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v53, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v21
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v14, v2
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v19
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 8, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_bfe_u32 v1, v17, 8, 8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_bfe_u32 v1, v1, 8, 8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:  .LBB47_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v61
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v62
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 24, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
@@ -94764,287 +94860,273 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v2, v1, s[0:3], 0 offen
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v55
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v33
-; GCN-NEXT:    v_or_b32_e32 v29, v1, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v11
+; GCN-NEXT:    v_or_b32_e32 v30, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v43
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
-; GCN-NEXT:    v_or_b32_e32 v30, v1, v2
+; GCN-NEXT:    v_or_b32_e32 v31, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v41
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; GCN-NEXT:    v_or_b32_e32 v2, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v47
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v61, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v40
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v62, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v58
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v54
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v44
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v7, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v51
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v8, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v45
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v9, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v25
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v26
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v10, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v46
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v11, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v26
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v27
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v12, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v56
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v13, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v27
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v28
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v22, v1, v3
+; GCN-NEXT:    v_or_b32_e32 v14, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v57
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v24, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v28
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v23, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v29
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v25, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v42
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v26, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v35
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v27, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v59
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v28, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v24
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v23, v1, v3
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v24, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v20
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GCN-NEXT:    v_or_b32_e32 v20, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v22
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_mov_b32_e32 v5, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v1, v3
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v18
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GCN-NEXT:    v_or_b32_e32 v18, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v21
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v21, v1, v3
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v16
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_mov_b32_e32 v4, v17
-; GCN-NEXT:    v_or_b32_e32 v17, v1, v3
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v16, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v19
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v20, v1, v3
+; GCN-NEXT:    v_or_b32_e32 v19, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v15
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; GCN-NEXT:    v_or_b32_e32 v15, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v53
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v18, v1, v3
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v14
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v17
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
-; GCN-NEXT:    v_or_b32_e32 v16, v1, v3
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v17, v1, v3
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v31, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v29, v3, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v52
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v33, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v32, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v34, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v35, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v36, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v37, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v38, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v39, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v48, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v49, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v50, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v51, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v52, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
@@ -95052,15 +95134,15 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v54, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
@@ -95068,15 +95150,15 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v40, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
@@ -95084,90 +95166,94 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v42, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v43, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v44, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v45, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v46, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v47, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v56, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v57, v3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v5
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v58, v3, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v59, v3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v4
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v60, v3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v29
-; GCN-NEXT:    v_or_b32_e32 v4, v1, v31
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v30
+; GCN-NEXT:    v_or_b32_e32 v4, v1, v29
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v31
 ; GCN-NEXT:    v_or_b32_e32 v5, v1, v33
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 8, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
@@ -95181,25 +95267,26 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_i32_e32 v62, vcc, 20, v0
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v36
 ; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 24, v0
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v29, v37
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 28, v0
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
+; GCN-NEXT:    v_or_b32_e32 v29, v29, v37
+; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 28, v0
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v30, v38
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v38
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 32, v0
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v39
@@ -95224,14 +95311,14 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v13
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v53
 ; GCN-NEXT:    v_add_i32_e32 v37, vcc, 60, v0
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v22
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v14
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v54
 ; GCN-NEXT:    v_add_i32_e32 v38, vcc, 64, v0
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v24
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v55
+; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v23
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v55
 ; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x44, v0
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v25
-; GCN-NEXT:    v_or_b32_e32 v24, v24, v40
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v25
+; GCN-NEXT:    v_or_b32_e32 v23, v23, v40
 ; GCN-NEXT:    v_add_i32_e32 v48, vcc, 0x48, v0
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v26
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v41
@@ -95242,34 +95329,32 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v28
 ; GCN-NEXT:    v_or_b32_e32 v27, v27, v43
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 0x54, v0
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v23, v44
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v44
 ; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x58, v0
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v45
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v45
 ; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x5c, v0
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v46
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v46
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x60, v0
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    v_or_b32_e32 v17, v17, v47
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v47
 ; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x64, v0
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v56
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v56
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x68, v0
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    v_or_b32_e32 v15, v15, v57
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v57
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x6c, v0
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v58
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v58
 ; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x70, v0
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v59
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v59
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x74, v0
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v60
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
+; GCN-NEXT:    v_or_b32_e32 v17, v17, v60
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
 ; GCN-NEXT:    buffer_store_dword v4, v1, s[0:3], 0 offen
@@ -95302,20 +95387,20 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v11, v37, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v12, v38, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v13, v39, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v22, v48, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v24, v49, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v14, v48, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v23, v49, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v25, v50, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v26, v28, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v27, v51, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v23, v52, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v19, v53, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v21, v54, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v17, v55, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v20, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v15, v41, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v18, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v14, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v16, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v24, v52, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v20, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v54, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v18, v55, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v21, v40, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v16, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v19, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v15, v43, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v17, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Reload
@@ -95904,8 +95989,8 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; VI-NEXT:    v_lshrrev_b64 v[45:46], 24, v[45:46]
 ; VI-NEXT:    v_lshrrev_b64 v[52:53], 24, v[52:53]
 ; VI-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:84 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshrrev_b32_e32 v31, 8, v49
@@ -96423,8 +96508,8 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:20 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:16 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    ; kill: killed $vgpr50
@@ -96549,7 +96634,7 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v33
 ; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:    ; kill: killed $vgpr33
@@ -96668,7 +96753,6 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 24, v2
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(45)
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 16, v32
 ; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v33, 8, v32
@@ -96807,7 +96891,6 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v34, off, s[0:3], s32 offset:128 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshrrev_b64 v[33:34], 24, v[13:14]
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
 ; GFX9-NEXT:    v_pk_add_f16 v32, v32, s6 op_sel_hi:[1,0]
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
 ; GFX9-NEXT:    v_pk_add_f16 v31, v31, s6 op_sel_hi:[1,0]
@@ -97224,13 +97307,14 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:80
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v22, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:84
@@ -97250,13 +97334,14 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:88
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v24, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:92
@@ -97276,13 +97361,14 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:96
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v26, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:100
@@ -97302,13 +97388,14 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:104
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v28, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:108
@@ -97328,13 +97415,14 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:112
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:172 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:164 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:156 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:164 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v1
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v30, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v2, 8, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:116
@@ -98668,436 +98756,428 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:988 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:992 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v49, v7
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v55, v1
-; GCN-NEXT:    v_mov_b32_e32 v60, v0
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:984 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v35, v0
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:56
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:52
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:48
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:44
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:40
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:24
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:16
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 8, v6
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 24, v8
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:12
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 8, v6
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v48, 24, v8
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 24, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v14
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 24, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 24, v12
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v48, 24, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 24, v12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v22
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v40, 24, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 24, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 24, v20
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v30
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 24, v28
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 24, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v10
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v18
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v26
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:392
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:120
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:112
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v23
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v21
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v33
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v32
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v13
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v25
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v17
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v19
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v29
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v7
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v5
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v9
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v7
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:80
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:72
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:164
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:156
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:96
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v7
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v8
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v10
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:164
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:156
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:152
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:148
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:144
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:196
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:188
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:196
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:188
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:184
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:180
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:176
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:220
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:228
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:220
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:212
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:208
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:252
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:260
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:252
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:248
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:244
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:240
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:548 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:848 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:284
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:292
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:284
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:280
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:276
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:272
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:316
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:324
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:316
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:312
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:308
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:304
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:348
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:344
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:356
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:348
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:344
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:340
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:336
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:388
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:380
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:376
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:388
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:380
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:376
 ; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:372
 ; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:368
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 8, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 24, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v5
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 24, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:692 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v31
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 8, v21
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:140
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:136
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:160
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:172
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:168
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:192
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:200
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:232
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 8, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:288
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 8, v2
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:72
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:40
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 8, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:300
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:296
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:320
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:332
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:328
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:328
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:352
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:364
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:360
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:360
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:384
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 8, v1
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v38, 8, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 8, v4
-; GCN-NEXT:    ; implicit-def: $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr9
+; GCN-NEXT:    ; implicit-def: $vgpr6
 ; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr31
-; GCN-NEXT:    ; implicit-def: $vgpr4
-; GCN-NEXT:    ; kill: killed $vgpr4
-; GCN-NEXT:    ; implicit-def: $vgpr18
-; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr8
 ; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr4
-; GCN-NEXT:    ; kill: killed $vgpr4
-; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    ; implicit-def: $vgpr21
+; GCN-NEXT:    ; implicit-def: $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr4
 ; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr18
+; GCN-NEXT:    ; implicit-def: $vgpr22
 ; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; kill: killed $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr6
 ; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; kill: killed $vgpr7
-; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr14
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; kill: killed $vgpr7
-; GCN-NEXT:    ; implicit-def: $vgpr7
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr51
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr10
@@ -99105,720 +99185,790 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; kill: killed $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr34
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; kill: killed $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; kill: killed $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr7
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; kill: killed $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr13
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; kill: killed $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; kill: killed $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; kill: killed $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; kill: killed $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; kill: killed $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; kill: killed $vgpr25
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; kill: killed $vgpr56
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB48_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v2, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v29, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v21, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v6, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v31, v1, v2
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v32, v1, v24
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v28, v1, v28
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v23, v1, v2
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v19
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v30
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v55
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v29
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v45
+; GCN-NEXT:    v_or_b32_e32 v20, v17, v20
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v30
+; GCN-NEXT:    v_or_b32_e32 v19, v17, v19
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v17, v17, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v1
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v17
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v24, v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v23, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v24, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v25, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v29, v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v25, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v31, v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v25, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v32, v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v25, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v33, v18, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v36, v18, v59
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v25, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v25, v1
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v25, v45
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v1
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v52
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v13, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:648 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v1
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v13, v51
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v44
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v26
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v12, v38
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v26
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v37
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v12, v39
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:684 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    v_or_b32_e32 v56, v48, v12
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v3
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
+; GCN-NEXT:    v_or_b32_e32 v1, v18, v38
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v28
+; GCN-NEXT:    v_or_b32_e32 v39, v18, v39
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v50, v12
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:644 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v15
-; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v2, v36, v13
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:992 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
+; GCN-NEXT:    v_or_b32_e32 v56, v47, v18
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v18
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v53, v13
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:988 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:984 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v3, v40, v22
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v47, v22
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v42
+; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v0
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v26
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v1
+; GCN-NEXT:    v_or_b32_e32 v59, v48, v26
+; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v1, v63, v26
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v0, v49, v26
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v63, v40, v28
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
+; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v0, v42, v28
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v54
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v62, v1, v33
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v1, v30
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v33
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
+; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v17, v1, v38
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v0, v57, v30
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v38
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v54
-; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v55, v1, v34
+; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v1, v34
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v61, v1, v48
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v61, v1, v37
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v48, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
+; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v48
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v0, v1, v37
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v59, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v62, v1, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v53
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v1, v38
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v63, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v52, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v52
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v58, v1, v52
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v53
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v52, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v52
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v1, v52
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v44, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v45, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v1, v40
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v58, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v52, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v57, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v46, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v1, v40
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v50, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v41, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v45, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v46, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v48, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v53, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v43, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v44, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v41, v1, v53
+; GCN-NEXT:    v_or_b32_e32 v51, v1, v40
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v42, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v43, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v50, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v47, v1, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v42, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v53, v1, v53
+; GCN-NEXT:    v_or_b32_e32 v48, v1, v40
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v54, 0xff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v54
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v54, v1, v54
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff, v10
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff, v27
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v55, 0xffff, v4
-; GCN-NEXT:    v_and_b32_e32 v40, 0xffff, v5
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v6
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v7
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff, v8
-; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v31
-; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v14
-; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v32
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v16
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v28
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v47, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v1
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v18
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v19
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v20
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v21
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v23
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v24
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v30
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v52, v1, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v40, v1, v40
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v10
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v12
+; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
+; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
+; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v22
+; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v13
+; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v14
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v23
+; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v16
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v20
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v19
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v24
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v25
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v27
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v29
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v31
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v32
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v33
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v36
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v35, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v0
+; GCN-NEXT:    v_and_b32_e32 v32, 0xffff, v39
+; GCN-NEXT:    v_mov_b32_e32 v0, v56
+; GCN-NEXT:    v_or_b32_e32 v37, v1, v0
+; GCN-NEXT:    v_mov_b32_e32 v1, v59
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v1
+; GCN-NEXT:    v_or_b32_e32 v27, v3, v63
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v60
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v55
+; GCN-NEXT:    v_or_b32_e32 v25, v6, v61
+; GCN-NEXT:    v_or_b32_e32 v39, v7, v62
+; GCN-NEXT:    v_or_b32_e32 v59, v8, v58
+; GCN-NEXT:    v_or_b32_e32 v56, v9, v45
+; GCN-NEXT:    v_or_b32_e32 v38, v10, v57
+; GCN-NEXT:    v_or_b32_e32 v34, v11, v41
+; GCN-NEXT:    v_or_b32_e32 v49, v12, v53
+; GCN-NEXT:    v_or_b32_e32 v10, v13, v51
+; GCN-NEXT:    v_or_b32_e32 v54, v14, v50
+; GCN-NEXT:    v_or_b32_e32 v11, v15, v48
+; GCN-NEXT:    v_or_b32_e32 v3, v16, v52
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v13, v17, v15
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v9, v18, v16
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v32, 0xffff, v1
-; GCN-NEXT:    v_mov_b32_e32 v1, v56
-; GCN-NEXT:    v_or_b32_e32 v8, v49, v1
-; GCN-NEXT:    v_or_b32_e32 v31, v51, v2
-; GCN-NEXT:    v_or_b32_e32 v56, v57, v3
-; GCN-NEXT:    v_or_b32_e32 v4, v55, v0
-; GCN-NEXT:    v_or_b32_e32 v5, v40, v62
-; GCN-NEXT:    v_or_b32_e32 v6, v37, v17
-; GCN-NEXT:    v_or_b32_e32 v7, v25, v61
-; GCN-NEXT:    v_or_b32_e32 v37, v34, v59
-; GCN-NEXT:    v_or_b32_e32 v25, v9, v63
-; GCN-NEXT:    v_or_b32_e32 v38, v10, v44
-; GCN-NEXT:    v_or_b32_e32 v51, v11, v52
-; GCN-NEXT:    v_or_b32_e32 v55, v12, v50
-; GCN-NEXT:    v_or_b32_e32 v49, v13, v48
-; GCN-NEXT:    v_or_b32_e32 v40, v14, v41
-; GCN-NEXT:    v_or_b32_e32 v11, v15, v39
-; GCN-NEXT:    v_or_b32_e32 v57, v16, v53
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v36, v12
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v16, v18, v14
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v10, v19, v13
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v20, v34
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:984 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v12, v19, v14
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v8, v20, v18
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v18, v21, v20
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v20, v21, v17
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v22, v21
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v23, v22
+; GCN-NEXT:    v_or_b32_e32 v7, v23, v22
 ; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v24, v23
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:980 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v33, v33, v24
+; GCN-NEXT:    v_or_b32_e32 v6, v24, v23
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:980 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v6, v36, v24
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v6, v26, v36
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v26, v58
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v6, v33, v46
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v27, v46
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v6, v28, v44
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v28, v45
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v6, v29, v43
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v29, v43
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v6, v30, v42
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v30, v42
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v6, v31, v47
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v35, v47
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v6, v32, v40
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v0
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v1
+; GCN-NEXT:    v_mov_b32_e32 v1, v37
+; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v63
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v15, v32, v54
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v60
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:976 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v1
-; GCN-NEXT:    v_mov_b32_e32 v1, v8
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:976 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v55
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v2
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v61
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v0
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v62
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v62
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v58
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v17
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v45
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v61
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v57
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v59
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v41
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v63
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v53
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v44
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v52
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v50
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v48
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v41
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v39
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v61, v1, v12, 16
-; GCN-NEXT:    v_mov_b32_e32 v12, v33
-; GCN-NEXT:    v_alignbit_b32 v33, v31, v14, 16
-; GCN-NEXT:    v_mov_b32_e32 v27, v56
-; GCN-NEXT:    v_alignbit_b32 v59, v56, v13, 16
-; GCN-NEXT:    v_mov_b32_e32 v13, v19
-; GCN-NEXT:    v_alignbit_b32 v29, v4, v34, 16
-; GCN-NEXT:    v_alignbit_b32 v0, v5, v20, 16
-; GCN-NEXT:    v_alignbit_b32 v14, v6, v21, 16
-; GCN-NEXT:    v_mov_b32_e32 v21, v18
-; GCN-NEXT:    v_mov_b32_e32 v18, v10
-; GCN-NEXT:    v_mov_b32_e32 v56, v16
-; GCN-NEXT:    v_mov_b32_e32 v16, v36
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v51
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v0, 16, v50
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v0, 16, v48
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v1, v15, 16
+; GCN-NEXT:    v_alignbit_b32 v63, v2, v16, 16
+; GCN-NEXT:    v_mov_b32_e32 v55, v27
+; GCN-NEXT:    v_alignbit_b32 v27, v27, v14, 16
+; GCN-NEXT:    v_mov_b32_e32 v14, v19
+; GCN-NEXT:    v_alignbit_b32 v41, v4, v18, 16
+; GCN-NEXT:    v_mov_b32_e32 v18, v20
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v5, v17, 16
+; GCN-NEXT:    v_mov_b32_e32 v17, v8
+; GCN-NEXT:    v_mov_b32_e32 v8, v12
+; GCN-NEXT:    v_mov_b32_e32 v12, v9
+; GCN-NEXT:    v_mov_b32_e32 v9, v13
+; GCN-NEXT:    v_mov_b32_e32 v13, v10
 ; GCN-NEXT:    v_mov_b32_e32 v10, v38
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v7, v22, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v62, v37
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v37, v23, 16
-; GCN-NEXT:    v_mov_b32_e32 v23, v9
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v37, v25
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v25, v24, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v10, v58, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v34, v51
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v51, v46, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v58, v55
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v55, v45, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v52, v49
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v49, v43, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v46, v40
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v15, v40, v42, 16
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v41, v11
-; GCN-NEXT:    v_alignbit_b32 v11, v11, v47, 16
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v57, off, s[0:3], s32 offset:996 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v3, v57, v54, 16
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v53
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
-; GCN-NEXT:    ; implicit-def: $vgpr55
+; GCN-NEXT:    v_mov_b32_e32 v60, v39
+; GCN-NEXT:    v_mov_b32_e32 v61, v25
+; GCN-NEXT:    v_alignbit_b32 v25, v25, v21, 16
+; GCN-NEXT:    v_mov_b32_e32 v21, v26
+; GCN-NEXT:    v_alignbit_b32 v16, v60, v22, 16
+; GCN-NEXT:    v_mov_b32_e32 v50, v59
+; GCN-NEXT:    v_alignbit_b32 v33, v59, v23, 16
+; GCN-NEXT:    v_mov_b32_e32 v23, v7
+; GCN-NEXT:    v_mov_b32_e32 v22, v0
+; GCN-NEXT:    v_mov_b32_e32 v51, v56
+; GCN-NEXT:    v_alignbit_b32 v0, v56, v24, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v10, v36, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v62, v34
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v34, v46, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v7, v49
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v49, v44, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v13, v43, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v53, v54
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v54, v42, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v43, v11
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v11, v47, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v46, v3
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v0, v3, v40, 16
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v0, 16, v52
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
 ; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
 ; GCN-NEXT:    ; implicit-def: $vgpr11
-; GCN-NEXT:    ; kill: killed $vgpr11
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr11
-; GCN-NEXT:    ; kill: killed $vgpr11
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; kill: killed $vgpr15
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
 ; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr28
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; kill: killed $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr0
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr11
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr54
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr19
@@ -99871,635 +100021,547 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr29
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr20
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; kill: killed $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr28
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; kill: killed $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; kill: killed $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr52
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:  .LBB48_2: ; %Flow
 ; GCN-NEXT:    s_or_saveexec_b64 s[4:5], s[4:5]
-; GCN-NEXT:    v_mov_b32_e32 v25, v27
+; GCN-NEXT:    v_mov_b32_e32 v56, v33
 ; GCN-NEXT:    s_xor_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB48_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v26
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v28
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v39, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v19
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v30
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v30, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v19, v2
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v38, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v1
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v45
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v20, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
-; GCN-NEXT:    v_or_b32_e32 v1, v51, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v37, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
-; GCN-NEXT:    v_or_b32_e32 v1, v28, v6
+; GCN-NEXT:    v_or_b32_e32 v1, v29, v6
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:860 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v6, v22, v7
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v6, v26, v7
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v7, v1, v8
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
-; GCN-NEXT:    v_or_b32_e32 v8, v45, v9
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v8, v52, v9
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
-; GCN-NEXT:    v_or_b32_e32 v9, v24, v10
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:848 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v9, v1, v10
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:872 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v12
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v13
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:868 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v1, v13
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v14
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v1, v14
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v14, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v14, v59, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v16, v1, v16
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v1
-; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v1
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v16, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v17, v1, v17
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v1, v18
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v1, v19
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v1, v20
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v1, v21
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v1, v22
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_mov_b32_e32 v2, v35
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v35, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v62, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v36, v58
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v51, v1, v23
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v17
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_mov_b32_e32 v4, v49
-; GCN-NEXT:    v_mov_b32_e32 v43, v42
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v42, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v39, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_mov_b32_e32 v44, v40
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v46, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:864 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v57, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v38, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v38, v1, v23
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v55
+; GCN-NEXT:    v_or_b32_e32 v58, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v59, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v62, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v63, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v55, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v61, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v0, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v5, v1, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v23
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v17, v24, v23
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v23
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v17, v24, v23
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v17, v24, v23
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v23
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v17, v24, v23
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:544 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v45, v49
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_or_b32_e32 v50, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_mov_b32_e32 v52, v36
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v39, v24, v23
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v52, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_mov_b32_e32 v45, v53
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v41, v54
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v53, v24, v23
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v53, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v55, v24, v23
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v23, v1, v23
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v24
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v25, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v24, v1, v24
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v25
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v25, v27, v25
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v1, v25
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:524 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v28, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v27, v1, v27
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:520 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v29, v28
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v28, v1, v28
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v29, v31, v29
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v29, v1, v29
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v31
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v31, v1, v31
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v17
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v32, v33, v32
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v54
+; GCN-NEXT:    v_or_b32_e32 v32, v1, v32
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v1
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v33, v1, v33
+; GCN-NEXT:    v_mov_b32_e32 v1, v47
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v17
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
 ; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v34
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:492 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v34, v36, v34
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 3, v11
+; GCN-NEXT:    v_or_b32_e32 v34, v49, v34
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 3, v36
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v36, v37, v36
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, 3, v2
+; GCN-NEXT:    v_or_b32_e32 v36, v49, v36
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 3, v44
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v37
 ; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v37, v2, v37
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, 3, v2
+; GCN-NEXT:    v_or_b32_e32 v37, v49, v37
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, 3, v15
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xff, v49
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v49, v2, v49
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:852 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v2, v40
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v54, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 3, v54
 ; GCN-NEXT:    v_and_b32_e32 v54, 0xff, v54
 ; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v54
-; GCN-NEXT:    v_or_b32_e32 v54, v63, v54
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v43
+; GCN-NEXT:    v_or_b32_e32 v54, v57, v54
+; GCN-NEXT:    v_add_i32_e32 v40, vcc, 3, v41
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xff, v40
 ; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v40, v2, v40
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v40, v41, v40
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 3, v41
 ; GCN-NEXT:    v_and_b32_e32 v41, 0xff, v41
 ; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v41
-; GCN-NEXT:    v_or_b32_e32 v41, v47, v41
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:988 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v41, v42, v41
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v2
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 3, v42
 ; GCN-NEXT:    v_and_b32_e32 v43, 0xff, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    v_or_b32_e32 v43, v44, v43
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:992 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v2
+; GCN-NEXT:    v_or_b32_e32 v43, v2, v43
+; GCN-NEXT:    v_add_i32_e32 v44, vcc, 3, v11
 ; GCN-NEXT:    v_and_b32_e32 v44, 0xff, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
 ; GCN-NEXT:    v_or_b32_e32 v44, v45, v44
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, 3, v15
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:984 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, 3, v2
 ; GCN-NEXT:    v_and_b32_e32 v45, 0xff, v45
 ; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v45
-; GCN-NEXT:    v_or_b32_e32 v45, v52, v45
+; GCN-NEXT:    v_or_b32_e32 v45, v48, v45
 ; GCN-NEXT:    v_add_i32_e32 v47, vcc, 3, v3
 ; GCN-NEXT:    v_and_b32_e32 v47, 0xff, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    v_or_b32_e32 v47, v50, v47
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, 3, v4
+; GCN-NEXT:    v_or_b32_e32 v47, v0, v47
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, 3, v0
 ; GCN-NEXT:    v_and_b32_e32 v56, 0xff, v56
 ; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v56
-; GCN-NEXT:    v_or_b32_e32 v56, v48, v56
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v56, v1, v56
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x300, v2
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x300, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    v_or_b32_e32 v61, v61, v2
+; GCN-NEXT:    v_or_b32_e32 v48, v55, v2
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v2
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, s7, v0
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v0, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v61, v2
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v3
+; GCN-NEXT:    v_add_i32_e32 v3, vcc, s7, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v5, v3
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v3, v4, v3
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v4
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, s7, v0
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v1, v4
+; GCN-NEXT:    v_or_b32_e32 v4, v5, v4
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v5
+; GCN-NEXT:    v_add_i32_e32 v5, vcc, s7, v0
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, s7, v6
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, s7, v7
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, s7, v8
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, s7, v9
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, s7, v10
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, s7, v11
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, s7, v0
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, s7, v12
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, s7, v13
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, s7, v14
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, s7, v15
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, s7, v16
+; GCN-NEXT:    v_add_i32_e32 v15, vcc, s7, v16
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, s7, v17
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, s7, v18
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, s7, v19
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v20
@@ -100507,14 +100569,14 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v22
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, s7, v26
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, s7, v30
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, s7, v35
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, s7, v51
-; GCN-NEXT:    v_add_i32_e32 v51, vcc, s7, v42
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, s7, v62
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, s7, v51
+; GCN-NEXT:    v_add_i32_e32 v51, vcc, s7, v39
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, s7, v46
-; GCN-NEXT:    v_add_i32_e32 v46, vcc, s7, v57
-; GCN-NEXT:    v_add_i32_e32 v57, vcc, s7, v38
+; GCN-NEXT:    v_add_i32_e32 v46, vcc, s7, v38
+; GCN-NEXT:    v_add_i32_e32 v57, vcc, s7, v58
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, s7, v59
-; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v62
+; GCN-NEXT:    v_add_i32_e32 v59, vcc, s7, v63
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
@@ -100536,31 +100598,29 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
-; GCN-NEXT:    v_and_b32_e32 v35, 0xffff, v35
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v51, 0xffff, v51
 ; GCN-NEXT:    v_and_b32_e32 v42, 0xffff, v42
 ; GCN-NEXT:    v_and_b32_e32 v46, 0xffff, v46
 ; GCN-NEXT:    v_and_b32_e32 v57, 0xffff, v57
 ; GCN-NEXT:    v_and_b32_e32 v58, 0xffff, v58
 ; GCN-NEXT:    v_and_b32_e32 v59, 0xffff, v59
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v38, v1
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v38, v5
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v1, v0, v1
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v38, v6
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v5, v0, v5
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v38, v7
-; GCN-NEXT:    v_or_b32_e32 v8, v39, v8
+; GCN-NEXT:    v_or_b32_e32 v6, v0, v6
+; GCN-NEXT:    v_or_b32_e32 v7, v50, v7
+; GCN-NEXT:    v_or_b32_e32 v8, v52, v8
 ; GCN-NEXT:    v_or_b32_e32 v9, v53, v9
-; GCN-NEXT:    v_or_b32_e32 v10, v55, v10
+; GCN-NEXT:    v_or_b32_e32 v10, v60, v10
 ; GCN-NEXT:    v_or_b32_e32 v11, v23, v11
 ; GCN-NEXT:    v_or_b32_e32 v12, v24, v12
-; GCN-NEXT:    v_or_b32_e32 v13, v25, v13
+; GCN-NEXT:    v_or_b32_e32 v23, v25, v13
 ; GCN-NEXT:    v_or_b32_e32 v14, v27, v14
 ; GCN-NEXT:    v_or_b32_e32 v15, v28, v15
 ; GCN-NEXT:    v_or_b32_e32 v16, v29, v16
@@ -100572,393 +100632,366 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v22, v37, v22
 ; GCN-NEXT:    v_or_b32_e32 v24, v49, v26
 ; GCN-NEXT:    v_or_b32_e32 v25, v54, v30
-; GCN-NEXT:    v_or_b32_e32 v26, v40, v35
+; GCN-NEXT:    v_or_b32_e32 v26, v40, v38
 ; GCN-NEXT:    v_or_b32_e32 v28, v41, v51
 ; GCN-NEXT:    v_or_b32_e32 v30, v43, v42
-; GCN-NEXT:    v_or_b32_e32 v33, v44, v46
+; GCN-NEXT:    v_or_b32_e32 v31, v44, v46
 ; GCN-NEXT:    v_or_b32_e32 v34, v45, v57
-; GCN-NEXT:    v_or_b32_e32 v38, v47, v58
-; GCN-NEXT:    v_or_b32_e32 v39, v56, v59
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, s6, v61
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, s6, v2
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, s6, v3
+; GCN-NEXT:    v_or_b32_e32 v27, v47, v58
+; GCN-NEXT:    v_or_b32_e32 v36, v56, v59
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, s6, v48
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v2
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, s6, v3
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s6, v4
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, s6, v1
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, s6, v5
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, s6, v6
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, s6, v7
-; GCN-NEXT:    v_add_i32_e32 v51, vcc, s6, v8
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, s6, v9
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, s6, v10
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, s6, v1
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, s6, v5
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, s6, v6
+; GCN-NEXT:    v_add_i32_e32 v13, vcc, s6, v7
+; GCN-NEXT:    v_add_i32_e32 v48, vcc, s6, v8
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, s6, v9
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, s6, v10
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, s6, v11
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, s6, v12
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, s6, v13
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v14
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, s6, v12
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, s6, v23
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, s6, v14
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s6, v15
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, s6, v16
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, s6, v17
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, s6, v16
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, s6, v17
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, s6, v18
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, s6, v19
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, s6, v20
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, s6, v21
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, s6, v22
+; GCN-NEXT:    v_add_i32_e32 v60, vcc, s6, v19
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, s6, v20
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, s6, v21
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, s6, v22
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, s6, v24
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, s6, v25
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, s6, v25
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s6, v26
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, s6, v28
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s6, v30
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, s6, v33
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, s6, v34
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, s6, v38
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v39
-; GCN-NEXT:    v_alignbit_b32 v61, v1, v16, 16
-; GCN-NEXT:    v_alignbit_b32 v33, v31, v56, 16
-; GCN-NEXT:    v_alignbit_b32 v59, v25, v18, 16
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:984 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v29, v4, v8, 16
-; GCN-NEXT:    v_alignbit_b32 v0, v5, v21, 16
-; GCN-NEXT:    v_alignbit_b32 v2, v6, v13, 16
-; GCN-NEXT:    v_alignbit_b32 v19, v7, v23, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:956 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:980 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, s6, v28
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, s6, v30
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v31
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v34
+; GCN-NEXT:    v_add_i32_e32 v9, vcc, s6, v27
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v36
+; GCN-NEXT:    v_alignbit_b32 v6, v1, v9, 16
+; GCN-NEXT:    v_alignbit_b32 v63, v2, v12, 16
+; GCN-NEXT:    v_alignbit_b32 v27, v20, v8, 16
+; GCN-NEXT:    v_alignbit_b32 v41, v4, v17, 16
+; GCN-NEXT:    v_alignbit_b32 v22, v5, v18, 16
+; GCN-NEXT:    v_alignbit_b32 v25, v19, v14, 16
+; GCN-NEXT:    v_alignbit_b32 v21, v60, v23, 16
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:980 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v56, v49, v16, 16
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v16, v21
+; GCN-NEXT:    buffer_store_dword v53, off, s[0:3], s32 offset:856 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v15, v53, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v52, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v17, v14, 16
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v14, v2
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v19, v15, v12, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v53, off, s[0:3], s32 offset:844 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v10, v52, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:936 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v50, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v10, v53, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v52, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v11, v50, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v11, v52, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v51, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v33, v48, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v39, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v48, v51, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v50, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v13, v39, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v38, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v36, v50, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v49, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v29, v38, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v37, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v35, v49, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v3, v37, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v3, v32, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v37, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v21, v0, v32, 16
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v19, v27, v37, 16
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v1
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v2
+; GCN-NEXT:    v_mov_b32_e32 v55, v20
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v1
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:976 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v20
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v4
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:976 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v31
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v25
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v5
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v61, v19
+; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v19
 ; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v60
 ; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:964 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v50, v49
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v6
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:960 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v7
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v62, v17
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v37, v15
+; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v49
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v51, v15
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:940 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v10
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v34, v11
+; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v62, v11
 ; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v58, v48
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v7, v33
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v48
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v52, v36
+; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v33
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v36
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v46, v35
+; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v13
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v53, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v35
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v41, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v29
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v43, v3
 ; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:996 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v27
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:944 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v46, v0
+; GCN-NEXT:    v_lshrrev_b32_e32 v0, 16, v0
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:952 ; 4-byte Folded Spill
 ; GCN-NEXT:  .LBB48_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v61
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v19
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
+; GCN-NEXT:    v_or_b32_e32 v6, v9, v6
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v32
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v9
+; GCN-NEXT:    buffer_store_dword v6, v35, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:976 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v19
-; GCN-NEXT:    buffer_store_dword v8, v60, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 4, v60
-; GCN-NEXT:    buffer_store_dword v1, v8, s[0:3], 0 offen
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 4, v35
+; GCN-NEXT:    buffer_store_dword v1, v6, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v33
-; GCN-NEXT:    v_or_b32_e32 v56, v1, v8
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v31
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v44, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, 8, v60
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v18
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v59
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v63
+; GCN-NEXT:    v_or_b32_e32 v45, v1, v6
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v31
 ; GCN-NEXT:    v_or_b32_e32 v63, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 12, v60
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v59, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 16, v60
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:984 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v29
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 8, v35
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v27
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 20, v60
+; GCN-NEXT:    v_add_i32_e32 v44, vcc, 12, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v55
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v57, v2, v3
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 16, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v41
+; GCN-NEXT:    v_or_b32_e32 v12, v2, v3
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 20, v35
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v4
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:976 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, 24, v60
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v21
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v0
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 28, v60
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 24, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v18
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v0, v2, v3
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 28, v35
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v5
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, 32, v60
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v13
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v61, v2, v3
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, 36, v60
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v6
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:960 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:972 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v0
 ; GCN-NEXT:    v_or_b32_e32 v47, v2, v3
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 40, v60
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v23
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:956 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, 44, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v7
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v23, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 48, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:980 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v15, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 52, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v62
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v57, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 56, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v12
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v21, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 60, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v37
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v6, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 64, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:844 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v25, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 0x44, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v10
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v9, vcc, 32, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v25
+; GCN-NEXT:    v_or_b32_e32 v55, v2, v3
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 36, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v61
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:968 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v27, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 0x48, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v58, v2, v3
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, 40, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v16
+; GCN-NEXT:    v_or_b32_e32 v3, v2, v3
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 44, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v60
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:964 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v60, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 48, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:980 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v56
+; GCN-NEXT:    v_or_b32_e32 v15, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 52, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v50
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:948 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v59, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 56, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:856 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v22, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 60, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v51
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:940 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v56, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 64, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:844 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:936 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v30, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 0x4c, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v34
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v26, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x44, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v10
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:932 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v32, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 0x50, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v28, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 0x48, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:840 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:928 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v34, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, 0x54, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v58
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v31, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x4c, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v62
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:924 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v29, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x58, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v33, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 0x50, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:920 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v37, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, 0x5c, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v52
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v10, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x54, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v7
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:916 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v5, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x60, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v29, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 0x58, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:912 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v48, v3, v7
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x64, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v46
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v38, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x5c, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v13
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:908 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v50, v3, v7
-; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x68, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v48, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x60, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:884 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:904 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v55, v3, v7
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x6c, v60
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v41
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:880 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v50, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x64, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v53
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:900 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v54, v3, v7
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x70, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v5, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, 0x68, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:876 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:896 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v41, v3, v7
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x74, v60
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:996 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v40, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x6c, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v43
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:892 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:944 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v54, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x70, v35
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v52, v3, v7
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x78, v60
-; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v60
-; GCN-NEXT:    buffer_store_dword v56, v45, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v44, v19, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v63, v16, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v59, v1, s[0:3], 0 offen
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v0
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:888 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v13, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x74, v35
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v46
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:952 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v52, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x78, v35
+; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v35
+; GCN-NEXT:    buffer_store_dword v45, v24, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v63, v44, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v1, v20, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v57, v17, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v12, v6, s[0:3], 0 offen
+; GCN-NEXT:    s_waitcnt expcnt(2)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, v4, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v1, v8, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v1, v9, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    buffer_store_dword v1, v11, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v61, v13, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v47, v14, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v2, v17, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v23, v18, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v15, v20, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v57, v22, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v21, v24, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v6, v26, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v25, v28, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v27, v31, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v30, v33, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v32, v35, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v34, v36, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v29, v38, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v37, v39, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v5, v49, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v47, v11, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v55, v14, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v58, v16, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v18, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v60, v19, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v15, v21, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v59, v23, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v25, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v56, v27, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v26, v30, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v32, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v31, v34, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v33, v36, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v10, v37, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v29, v39, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v38, v49, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v48, v51, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v50, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v55, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v50, v7, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v5, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v40, v42, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v54, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v41, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v13, v53, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v52, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
@@ -100974,8 +101007,8 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(3)
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
@@ -101032,39 +101065,39 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:136
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:144
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:152
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:136
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:144
+; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:152
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:160
 ; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:168
 ; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:176
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:184
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v25
 ; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v1, 8, v29
-; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v3
 ; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v5
 ; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v7
-; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v9
 ; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
-; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v13
-; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v15
+; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v15
 ; VI-NEXT:    v_lshlrev_b16_e32 v35, 8, v17
 ; VI-NEXT:    v_lshlrev_b16_e32 v36, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v21
 ; VI-NEXT:    v_lshlrev_b16_e32 v34, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v27
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
@@ -101078,46 +101111,47 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v27
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
 ; VI-NEXT:    ; implicit-def: $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v37
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v38
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v39
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v48
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v49
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
-; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v50
+; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v51
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:520 ; 4-byte Folded Spill
@@ -101134,7 +101168,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v43
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:208
@@ -101142,15 +101176,15 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
@@ -101173,20 +101207,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -101199,20 +101233,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:696 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -101220,17 +101254,17 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:636 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
@@ -101246,17 +101280,17 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:316
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:804 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:324
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:820 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; VI-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
@@ -101271,44 +101305,44 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_ushort v3, off, s[0:3], s32 offset:376
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:348
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v1
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v1
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v2
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v3
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:372
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; VI-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
 ; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
 ; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -101334,31 +101368,31 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v49 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v4, v4, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr54
 ; VI-NEXT:    ; implicit-def: $vgpr55
 ; VI-NEXT:    ; implicit-def: $vgpr40
 ; VI-NEXT:    ; implicit-def: $vgpr41
-; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr36
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v34 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr34
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v52 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr50
 ; VI-NEXT:    ; implicit-def: $vgpr52
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v37 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr37
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -101377,35 +101411,35 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_or_b32_sdwa v10, v63, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v61, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_or_b32_sdwa v12, v58, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v59, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_or_b32_sdwa v14, v45, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    ; implicit-def: $vgpr45
 ; VI-NEXT:    s_waitcnt vmcnt(2)
@@ -101414,26 +101448,26 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v63, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr62
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v60, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr59
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v10, v57, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v56, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr56
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v58, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr56
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -101442,35 +101476,35 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v13, v44, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -101485,20 +101519,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:532 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -101506,189 +101540,189 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v28, v28, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v31, v31, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    v_or_b32_sdwa v31, v31, v49 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr49
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v32, v32, v53 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v30, v30, v37 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v30, v30, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
-; VI-NEXT:    ; kill: killed $vgpr37
-; VI-NEXT:    ; implicit-def: $vgpr37
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; kill: killed $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v30, v30, v38 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v30, v30, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v30, v30, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr39
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v31, v31, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v31, v31, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v31, v31, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    ; kill: killed $vgpr32
@@ -101793,7 +101827,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    ; kill: killed $vgpr32
 ; VI-NEXT:    ; implicit-def: $vgpr32
-; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr51
 ; VI-NEXT:  .LBB48_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB48_4
@@ -101801,53 +101835,51 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v18, 0x300
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
 ; VI-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    v_add_u16_sdwa v4, v0, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(12)
 ; VI-NEXT:    v_or_b32_sdwa v29, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(13)
+; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v4, v0, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_sdwa v0, v2, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v29, 0x300, v29
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v2, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v2, v0
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v2, 3, v2
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v2, 3, v2
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
 ; VI-NEXT:    v_or_b32_sdwa v2, v52, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v3, v51, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v50, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v2, v2, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v2, v3, v2
@@ -101857,19 +101889,19 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v1, 0x300, v1
 ; VI-NEXT:    v_or_b32_e32 v1, v1, v4
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v3, v49, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v48, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v3, v3, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
-; VI-NEXT:    v_or_b32_sdwa v4, v39, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v4, v37, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v4, 0x300, v4
 ; VI-NEXT:    v_or_b32_e32 v3, v4, v3
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
@@ -101877,14 +101909,13 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
 ; VI-NEXT:    v_or_b32_sdwa v6, v33, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v6, 0x300, v6
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    s_waitcnt vmcnt(2)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v5, 3, v5
 ; VI-NEXT:    v_or_b32_sdwa v5, v35, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v5, 0x300, v5
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v4, 3, v4
 ; VI-NEXT:    v_or_b32_sdwa v4, v36, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v4, v4, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
@@ -101901,7 +101932,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v6, v32, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_or_b32_sdwa v32, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_sdwa v32, v32, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
@@ -101909,7 +101940,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v28, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v28, 0x300, v28
 ; VI-NEXT:    v_or_b32_e32 v28, v28, v32
 ; VI-NEXT:    s_waitcnt vmcnt(1)
@@ -101917,78 +101948,94 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v33, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v33, v33, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v27, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:536 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v27, 0x300, v27
 ; VI-NEXT:    v_or_b32_e32 v27, v27, v33
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v34, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v34, v34, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v26, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v26, 0x300, v26
 ; VI-NEXT:    v_or_b32_e32 v26, v26, v34
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v35, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v35, v35, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v25, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v25, 0x300, v25
+; VI-NEXT:    v_or_b32_e32 v25, v25, v35
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v36, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v6, v7, v6
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v25, 0x300, v25
-; VI-NEXT:    v_or_b32_e32 v25, v25, v35
+; VI-NEXT:    v_add_u16_sdwa v36, v36, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v36, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_sdwa v36, v36, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v24, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
+; VI-NEXT:    v_or_b32_e32 v24, v24, v36
+; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v37, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v37, v37, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    v_or_b32_sdwa v24, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_sdwa v23, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
 ; VI-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v24, 0x300, v24
-; VI-NEXT:    v_or_b32_e32 v24, v24, v36
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
-; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
+; VI-NEXT:    v_or_b32_e32 v23, v23, v37
 ; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    v_add_u16_e32 v22, 3, v22
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
 ; VI-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v7, v8, v7
-; VI-NEXT:    v_add_u16_e32 v8, 3, v61
+; VI-NEXT:    v_add_u16_e32 v8, 3, v63
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 3, v62
@@ -101997,30 +102044,30 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v8, v9, v8
-; VI-NEXT:    v_add_u16_e32 v9, 3, v63
+; VI-NEXT:    v_add_u16_e32 v9, 3, v61
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v10, 3, v59
+; VI-NEXT:    v_add_u16_e32 v10, 3, v60
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v9, v10, v9
-; VI-NEXT:    v_add_u16_e32 v10, 3, v60
+; VI-NEXT:    v_add_u16_e32 v10, 3, v57
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v11, 3, v57
+; VI-NEXT:    v_add_u16_e32 v11, 3, v56
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v10, v11, v10
-; VI-NEXT:    v_add_u16_e32 v11, 3, v58
+; VI-NEXT:    v_add_u16_e32 v11, 3, v59
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v12, 3, v56
+; VI-NEXT:    v_add_u16_e32 v12, 3, v58
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v11, v12, v11
@@ -102029,7 +102076,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v12, v13, v12
@@ -102038,7 +102085,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 3, v44
 ; VI-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v13, v14, v13
@@ -102050,35 +102097,35 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; VI-NEXT:    v_or_b32_e32 v14, v15, v14
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v15, 3, v15
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v15, v15, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
 ; VI-NEXT:    v_or_b32_sdwa v17, v19, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
 ; VI-NEXT:    v_or_b32_sdwa v20, v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -102087,54 +102134,43 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v16, v19, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v19, 0x300, v20
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; VI-NEXT:    v_or_b32_e32 v16, v19, v16
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
-; VI-NEXT:    v_or_b32_sdwa v30, v38, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v30, v39, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v30, 0x300, v30
-; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v20, 3, v20
-; VI-NEXT:    v_or_b32_sdwa v31, v50, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v31, v51, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v31, 0x300, v31
-; VI-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v21, 3, v21
-; VI-NEXT:    v_or_b32_sdwa v21, v37, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v37, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
+; VI-NEXT:    v_or_b32_sdwa v21, v38, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v40, v21, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_sdwa v37, v37, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_e32 v29, v29, v40
-; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v23, v23, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v23, 0x300, v23
-; VI-NEXT:    v_or_b32_e32 v23, v23, v37
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v38, v38, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_e32 v20, 3, v20
-; VI-NEXT:    v_or_b32_sdwa v20, v48, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
-; VI-NEXT:    v_add_u16_sdwa v55, v20, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_sdwa v38, v38, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_e32 v30, v30, v55
-; VI-NEXT:    s_waitcnt vmcnt(1)
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v22, 3, v22
 ; VI-NEXT:    v_or_b32_sdwa v22, v39, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v20, 3, v20
+; VI-NEXT:    v_or_b32_sdwa v20, v49, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_sdwa v55, v20, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v22, 0x300, v22
 ; VI-NEXT:    v_or_b32_e32 v22, v22, v38
-; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_e32 v30, v30, v55
+; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v39, 3, v39
 ; VI-NEXT:    v_or_b32_sdwa v39, v48, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:552 ; 4-byte Folded Reload
@@ -102156,7 +102192,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
 ; VI-NEXT:    v_add_u16_e32 v19, 3, v19
 ; VI-NEXT:    v_or_b32_sdwa v19, v53, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; VI-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v39, 3, v39
 ; VI-NEXT:    v_or_b32_sdwa v39, v49, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -102258,17 +102294,16 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:112
 ; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
 ; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:136
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:136
 ; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:144
 ; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:152
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:160
 ; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:168
 ; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:176
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:184
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v27
-; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v1, 8, v29
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v3
@@ -102276,81 +102311,81 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v7
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v11
-; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v13
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v15
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v17
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v35, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v21
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v23
-; GFX9-NEXT:    s_waitcnt vmcnt(27)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v25
 ; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v30
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:828 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:812 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v4
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:808 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v6
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v8
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v10
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v12
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v14
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v16
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v18
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v20
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v22
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v24
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v26
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; GFX9-NEXT:    s_waitcnt vmcnt(26)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:132
-; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v28
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:140
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v31
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
-; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v25
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    ; implicit-def: $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:624 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:140
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:148
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v37
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:720 ; 4-byte Folded Spill
-; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v39
+; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v48
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:680 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v49
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:672 ; 4-byte Folded Spill
@@ -102358,7 +102393,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:688 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:156
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:164
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
@@ -102375,7 +102410,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v43
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v44
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:584 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:580 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:192
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:200
@@ -102384,15 +102419,15 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:188
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:620 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:540 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:196
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:612 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:628 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
@@ -102416,20 +102451,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:556 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:228
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:660 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:572 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:564 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:236
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:560 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:244
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:568 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:256
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:264
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:272
@@ -102443,20 +102478,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:592 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:260
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:700 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:268
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:600 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:596 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:276
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:608 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:288
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:296
 ; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:304
@@ -102465,17 +102500,17 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:284
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v1
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:632 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:292
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:776 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:640 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:300
@@ -102502,7 +102537,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v2
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:816 ; 4-byte Folded Spill
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v0, 8, v3
-; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:836 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:676 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:332
@@ -102522,48 +102557,48 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v1
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v3
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:356
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:364
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:372
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:384
 ; GFX9-NEXT:    buffer_load_ushort v1, off, s[0:3], s32 offset:380
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:100
 ; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:92
 ; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v56, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v59, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB48_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:488 ; 4-byte Folded Reload
@@ -102572,9 +102607,9 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s6
@@ -102603,10 +102638,10 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; implicit-def: $vgpr51
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v3, v4, v3, s6
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:484 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr48
+; GFX9-NEXT:    ; implicit-def: $vgpr39
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v36 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
@@ -102617,7 +102652,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v5, v6, v5, s6
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr34
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -102632,93 +102667,93 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v7, v8, v7, s6
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:812 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v60, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v60, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v56, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v56, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v58, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v46, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v44, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v15, v43, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr43
+; GFX9-NEXT:    v_or_b32_sdwa v15, v42, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v63, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v62, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v59, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v61, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr59
+; GFX9-NEXT:    ; implicit-def: $vgpr61
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v58, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v57, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v11, v57, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v59, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v12, v46, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v47, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr46
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    ; implicit-def: $vgpr47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v45, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v43, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr42
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:680 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v16, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v17, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v16, v17, v16, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v18, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -102733,20 +102768,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v18, v19, v18, s6
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v19, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v19, v20, v19, s6
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v20, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v20, v21, v20, s6
@@ -102754,58 +102789,58 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:628 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v21, v21, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v21, v22, v21, s6
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v22, v22, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v22, v23, v22, s6
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v23, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v23, v24, v23, s6
-; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:704 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v24, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v24, v25, v24, s6
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v25, v25, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v25, v26, v25, s6
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v26, v26, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v26, v27, v26, s6
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:644 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v27, v27, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:664 ; 4-byte Folded Reload
@@ -102823,22 +102858,22 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v28, v29, v28, s6
 ; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v29, v29, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v32, v32, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr53
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v37 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v29, v30, v29, s6
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr37
 ; GFX9-NEXT:    ; kill: killed $vgpr37
 ; GFX9-NEXT:    ; implicit-def: $vgpr37
@@ -102935,7 +102970,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v30, v30, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v30, v31, v30, s6
-; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr49
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v31, v31, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
@@ -103047,27 +103082,27 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_cbranch_execz .LBB48_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:676 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:512 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:672 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:504 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:788 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:820 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:508 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:776 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:836 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:816 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:804 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:684 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:832 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:808 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:828 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:796 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:720 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:696 ; 4-byte Folded Reload
@@ -103075,19 +103110,20 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
+; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v55, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
+; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v2
+; GFX9-NEXT:    s_waitcnt vmcnt(17)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
 ; GFX9-NEXT:    v_or_b32_sdwa v3, v54, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v2
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v3
-; GFX9-NEXT:    v_add_u16_e32 v24, 3, v24
 ; GFX9-NEXT:    v_perm_b32 v0, v2, v0, s6
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:500 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(17)
+; GFX9-NEXT:    s_waitcnt vmcnt(16)
 ; GFX9-NEXT:    v_or_b32_sdwa v24, v25, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:668 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:496 ; 4-byte Folded Reload
@@ -103112,7 +103148,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:480 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v3, 3, v3
-; GFX9-NEXT:    v_or_b32_sdwa v3, v48, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v3, v39, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v4, 3, v4
@@ -103138,7 +103174,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:476 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_or_b32_sdwa v36, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:640 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:472 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 3, v22
@@ -103162,11 +103198,11 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_or_b32_sdwa v23, v37, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v37, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:760 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:636 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:764 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 3, v21
-; GFX9-NEXT:    v_or_b32_sdwa v21, v39, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v21, v48, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v33, 0x300, v21
 ; GFX9-NEXT:    v_add_u16_e32 v34, 0x300, v23
 ; GFX9-NEXT:    v_perm_b32 v29, v34, v29, s6
@@ -103174,16 +103210,17 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v32, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v6, 0x300, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_perm_b32 v6, v7, v6, s6
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v38, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:708 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:824 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
@@ -103192,17 +103229,18 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v39, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:632 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v8, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v39
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v48, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:604 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -103215,45 +103253,45 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v62
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:800 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v59
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v61
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v10, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v60
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:792 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v58
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v57
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v11, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v60
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v56
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:784 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:780 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v57
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v59
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v12, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v56
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v58
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:772 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:768 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v46
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v47
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v13, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v46
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:752 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:748 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
@@ -103262,63 +103300,63 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v14, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 3, v44
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v43
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v15, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v15, 3, v43
+; GFX9-NEXT:    v_add_u16_e32 v15, 3, v42
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:716 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v16, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:624 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:688 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v17, 3, v17
 ; GFX9-NEXT:    v_or_b32_sdwa v17, v18, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v17, 0x300, v17
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v19, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v20, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v16
 ; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v18
 ; GFX9-NEXT:    v_perm_b32 v17, v17, v20, s6
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:728 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:736 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v19
 ; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:756 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v16, v18, v16, s6
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:740 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:744 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v20, 3, v20
 ; GFX9-NEXT:    v_or_b32_sdwa v20, v49, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_or_b32_sdwa v49, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:600 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:596 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:700 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v20
 ; GFX9-NEXT:    v_perm_b32 v30, v33, v30, s6
@@ -103326,7 +103364,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v50, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:576 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:660 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v18, 3, v18
 ; GFX9-NEXT:    v_or_b32_sdwa v18, v52, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -103336,14 +103374,14 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v51, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:592 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:588 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:692 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v23, 0x300, v51
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v52, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:572 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:656 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 3, v19
 ; GFX9-NEXT:    v_or_b32_sdwa v19, v53, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
@@ -103353,7 +103391,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v53, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:568 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:560 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:652 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v53
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -103366,7 +103404,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v55, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:564 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:556 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:648 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v55
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -103374,7 +103412,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v40, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:548 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:620 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v50, 0x300, v40
 ; GFX9-NEXT:    v_perm_b32 v21, v50, v21, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -103382,14 +103420,14 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v41, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:544 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:616 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v41
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v42, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:528 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:584 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:580 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v51, 0x300, v42
 ; GFX9-NEXT:    v_perm_b32 v20, v51, v20, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
@@ -103397,7 +103435,7 @@ define <64 x i16> @bitcast_v128i8_to_v64i16(<128 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v43, v28, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:540 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:612 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:608 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_add_u16_e32 v19, 0x300, v43
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v27, 3, v27
@@ -105176,19 +105214,10 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(5)
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:136
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:64
 ; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:60
@@ -105199,113 +105228,98 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:40
 ; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:32
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:8
+; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v1
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v4
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v6, 1.0, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v9
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v18, 1.0, v18
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v26
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v29
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v24, 1.0, v24
+; GCN-NEXT:    v_mul_f32_e32 v25, 1.0, v25
+; GCN-NEXT:    v_mul_f32_e32 v26, 1.0, v26
+; GCN-NEXT:    v_mul_f32_e32 v27, 1.0, v27
+; GCN-NEXT:    v_mul_f32_e32 v28, 1.0, v28
+; GCN-NEXT:    v_mul_f32_e32 v29, 1.0, v29
+; GCN-NEXT:    v_mul_f32_e32 v30, 1.0, v30
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:120
-; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v32
-; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v33
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:136
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v34
 ; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
 ; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v36
-; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v3
-; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v2
-; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v37
+; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v38
+; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
 ; GCN-NEXT:    v_mul_f32_e32 v48, 1.0, v48
 ; GCN-NEXT:    v_mul_f32_e32 v49, 1.0, v49
 ; GCN-NEXT:    v_mul_f32_e32 v50, 1.0, v50
@@ -105318,28 +105332,33 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v41
 ; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v42
 ; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v43
-; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v44
-; GCN-NEXT:    v_mul_f32_e32 v45, 1.0, v45
-; GCN-NEXT:    v_mul_f32_e32 v46, 1.0, v46
-; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v47
-; GCN-NEXT:    v_mul_f32_e32 v56, 1.0, v56
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:112
-; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v57
-; GCN-NEXT:    v_mul_f32_e32 v58, 1.0, v58
+; GCN-NEXT:    v_mul_f32_e32 v44, 1.0, v32
+; GCN-NEXT:    v_mul_f32_e32 v45, 1.0, v31
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_mul_f32_e32 v46, 1.0, v1
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_mul_f32_e32 v47, 1.0, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_mul_f32_e32 v59, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v56, 1.0, v3
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v2
-; GCN-NEXT:    v_mul_f32_e32 v61, 1.0, v61
+; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v10
+; GCN-NEXT:    v_mul_f32_e32 v58, 1.0, v4
+; GCN-NEXT:    v_mul_f32_e32 v59, 1.0, v5
+; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v7
+; GCN-NEXT:    v_mul_f32_e32 v61, 1.0, v8
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:128
 ; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
-; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v62
+; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v9
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v2
+; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
@@ -105350,47 +105369,26 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr9
-; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr11
-; GCN-NEXT:    ; implicit-def: $vgpr7
-; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; implicit-def: $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr12
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr23
-; GCN-NEXT:    ; implicit-def: $vgpr18
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr24
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr9
+; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr28
+; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr8
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
@@ -105413,6 +105411,25 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
@@ -105439,110 +105456,108 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr14
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr15
 ; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr31
+; GCN-NEXT:    ; implicit-def: $vgpr20
+; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr21
+; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr23
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB49_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v33
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
-; GCN-NEXT:    v_mov_b32_e32 v5, v4
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v7, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v6
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v12, 16, v6
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v14, 16, v6
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v15, 16, v6
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v16, 16, v6
+; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v13
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v6
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v6
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v6
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v31, 16, v18
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v4
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v4
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v13, 16, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v22
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v25
+; GCN-NEXT:    v_lshrrev_b32_e32 v33, 16, v26
+; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v27
+; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v28
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v29
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v30
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v4
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v37, v36
+; GCN-NEXT:    v_mov_b32_e32 v36, v35
+; GCN-NEXT:    v_mov_b32_e32 v35, v34
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    v_lshrrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    v_lshrrev_b32_e32 v34, 16, v34
+; GCN-NEXT:    v_lshrrev_b32_e32 v34, 16, v4
 ; GCN-NEXT:    v_lshrrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
 ; GCN-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
 ; GCN-NEXT:    v_lshrrev_b32_e32 v38, 16, v38
 ; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v39
 ; GCN-NEXT:    v_lshrrev_b32_e32 v48, 16, v48
@@ -105568,163 +105583,169 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v60, 16, v60
 ; GCN-NEXT:    v_lshrrev_b32_e32 v61, 16, v61
 ; GCN-NEXT:    v_lshrrev_b32_e32 v62, 16, v62
-; GCN-NEXT:    v_lshrrev_b32_e32 v63, 16, v63
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_lshrrev_b32_e32 v63, 16, v63
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
-; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
-; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v5
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v6
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v7
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v8
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v9
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v10
-; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v11
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v12
-; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v13
-; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v15
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v16
-; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v17
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v18
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v19
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v21
-; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v22
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v23
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v24
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v25
-; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v27
-; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v28
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v3
-; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v32
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v10
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v12
+; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v14
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v15
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v16
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v17
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v19
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v20
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v21
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v23
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v31
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v32
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v6
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v13
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v18
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v22
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v24
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v25
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v33
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v34
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v35
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v36
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v26
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v37
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v27
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v29
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v30
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v34
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v35
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v36
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v37
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v37, v3
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v11
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v38
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v39
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v48
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v49
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v50
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v51
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v52
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v53
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v54
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v55
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v40
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v41
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v42
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v43
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v44
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v45
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v45
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v46
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v47
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v47
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v56
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v57
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v58
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v59
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v60
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v61
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v62
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v63
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v2
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr4
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v57
+; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v58
+; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v59
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v60
+; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v61
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v62
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v63
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v2
+; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
@@ -105733,6 +105754,7 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr6
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
@@ -105745,6 +105767,7 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr13
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
@@ -105753,24 +105776,31 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr18
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr22
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr24
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr27
+; GCN-NEXT:    ; implicit-def: $vgpr28
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr30
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr37
+; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr38
 ; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:    ; implicit-def: $vgpr48
@@ -105796,226 +105826,248 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr62
-; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr2
 ; GCN-NEXT:  .LBB49_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB49_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
-; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v63
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v62
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v1
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v62
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v61
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v60
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v59
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v58
-; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v57
-; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v56
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v47
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v46
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v45
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v44
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v43
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v42
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v41
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v40
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v55
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v54
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v53
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v52
-; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v51
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v50
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v49
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v48
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v39
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v38
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v37
+; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v57
+; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v56
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v47
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v46
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v45
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v44
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v43
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v42
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v41
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v40
+; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v55
+; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v54
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v53
+; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v52
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v51
+; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v50
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v49
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v48
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v39
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v38
+; GCN-NEXT:    v_and_b32_e32 v54, 0xffff0000, v11
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v36
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v35
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v34
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v33
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v32
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v54, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v55, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v41, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v42, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v41, 0xffff0000, v30
+; GCN-NEXT:    v_and_b32_e32 v42, 0xffff0000, v29
+; GCN-NEXT:    v_and_b32_e32 v43, 0xffff0000, v28
+; GCN-NEXT:    v_and_b32_e32 v44, 0xffff0000, v27
+; GCN-NEXT:    v_and_b32_e32 v45, 0xffff0000, v26
+; GCN-NEXT:    v_and_b32_e32 v46, 0xffff0000, v25
+; GCN-NEXT:    v_and_b32_e32 v47, 0xffff0000, v24
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v43, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v56, 0xffff0000, v1
+; GCN-NEXT:    v_and_b32_e32 v57, 0xffff0000, v22
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v44, 0xffff0000, v1
+; GCN-NEXT:    v_and_b32_e32 v58, 0xffff0000, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v45, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v59, 0xffff0000, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v46, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v60, 0xffff0000, v1
+; GCN-NEXT:    v_and_b32_e32 v61, 0xffff0000, v18
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v47, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v62, 0xffff0000, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v56, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v57, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v58, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v59, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v60, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v61, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v62, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v63, 0xffff0000, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v6
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v33
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v2
 ; GCN-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
-; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v5
-; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v6
+; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v4
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v7
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v8
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v9
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v10
-; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v11
-; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v12
-; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v13
-; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v14
-; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v15
-; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v16
-; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v17
-; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v18
-; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v19
-; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v20
-; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v21
-; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v22
-; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v23
-; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v24
-; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v25
-; GCN-NEXT:    v_add_f32_e32 v25, 0x40c00000, v26
-; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v27
-; GCN-NEXT:    v_add_f32_e32 v27, 0x40c00000, v28
-; GCN-NEXT:    v_add_f32_e32 v28, 0x40c00000, v29
-; GCN-NEXT:    v_add_f32_e32 v29, 0x40c00000, v30
-; GCN-NEXT:    v_add_f32_e32 v30, 0x40c00000, v31
+; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v12
+; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v14
+; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v15
+; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v16
+; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v17
+; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v19
+; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v20
+; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v21
+; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v31
+; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v32
+; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v37
+; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v53
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v52
+; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v51
+; GCN-NEXT:    v_add_f32_e32 v25, 0x40c00000, v50
+; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v49
+; GCN-NEXT:    v_add_f32_e32 v27, 0x40c00000, v48
+; GCN-NEXT:    v_add_f32_e32 v28, 0x40c00000, v39
+; GCN-NEXT:    v_add_f32_e32 v29, 0x40c00000, v38
+; GCN-NEXT:    v_add_f32_e32 v30, 0x40c00000, v54
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x40c00000, v36
 ; GCN-NEXT:    v_add_f32_e32 v32, 0x40c00000, v35
 ; GCN-NEXT:    v_add_f32_e32 v33, 0x40c00000, v34
-; GCN-NEXT:    v_add_f32_e32 v34, 0x40c00000, v37
-; GCN-NEXT:    v_add_f32_e32 v35, 0x40c00000, v38
-; GCN-NEXT:    v_add_f32_e32 v36, 0x40c00000, v39
-; GCN-NEXT:    v_add_f32_e32 v37, 0x40c00000, v48
-; GCN-NEXT:    v_add_f32_e32 v38, 0x40c00000, v49
-; GCN-NEXT:    v_add_f32_e32 v39, 0x40c00000, v50
-; GCN-NEXT:    v_add_f32_e32 v48, 0x40c00000, v51
-; GCN-NEXT:    v_add_f32_e32 v49, 0x40c00000, v52
-; GCN-NEXT:    v_add_f32_e32 v50, 0x40c00000, v53
-; GCN-NEXT:    v_add_f32_e32 v51, 0x40c00000, v54
-; GCN-NEXT:    v_add_f32_e32 v52, 0x40c00000, v55
-; GCN-NEXT:    v_add_f32_e32 v53, 0x40c00000, v40
-; GCN-NEXT:    v_add_f32_e32 v54, 0x40c00000, v41
-; GCN-NEXT:    v_add_f32_e32 v55, 0x40c00000, v42
-; GCN-NEXT:    v_add_f32_e32 v40, 0x40c00000, v43
-; GCN-NEXT:    v_add_f32_e32 v41, 0x40c00000, v44
-; GCN-NEXT:    v_add_f32_e32 v42, 0x40c00000, v45
-; GCN-NEXT:    v_add_f32_e32 v43, 0x40c00000, v46
-; GCN-NEXT:    v_add_f32_e32 v44, 0x40c00000, v47
-; GCN-NEXT:    v_add_f32_e32 v45, 0x40c00000, v56
-; GCN-NEXT:    v_add_f32_e32 v46, 0x40c00000, v57
-; GCN-NEXT:    v_add_f32_e32 v47, 0x40c00000, v58
-; GCN-NEXT:    v_add_f32_e32 v56, 0x40c00000, v59
-; GCN-NEXT:    v_add_f32_e32 v57, 0x40c00000, v60
-; GCN-NEXT:    v_add_f32_e32 v58, 0x40c00000, v61
-; GCN-NEXT:    v_add_f32_e32 v59, 0x40c00000, v62
-; GCN-NEXT:    v_add_f32_e32 v60, 0x40c00000, v63
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_f32_e32 v34, 0x40c00000, v55
+; GCN-NEXT:    v_add_f32_e32 v35, 0x40c00000, v40
+; GCN-NEXT:    v_add_f32_e32 v36, 0x40c00000, v41
+; GCN-NEXT:    v_add_f32_e32 v37, 0x40c00000, v42
+; GCN-NEXT:    v_add_f32_e32 v38, 0x40c00000, v43
+; GCN-NEXT:    v_add_f32_e32 v39, 0x40c00000, v44
+; GCN-NEXT:    v_add_f32_e32 v48, 0x40c00000, v45
+; GCN-NEXT:    v_add_f32_e32 v49, 0x40c00000, v46
+; GCN-NEXT:    v_add_f32_e32 v50, 0x40c00000, v47
+; GCN-NEXT:    v_add_f32_e32 v51, 0x40c00000, v56
+; GCN-NEXT:    v_add_f32_e32 v52, 0x40c00000, v57
+; GCN-NEXT:    v_add_f32_e32 v53, 0x40c00000, v58
+; GCN-NEXT:    v_add_f32_e32 v54, 0x40c00000, v59
+; GCN-NEXT:    v_add_f32_e32 v55, 0x40c00000, v60
+; GCN-NEXT:    v_add_f32_e32 v40, 0x40c00000, v61
+; GCN-NEXT:    v_add_f32_e32 v41, 0x40c00000, v62
+; GCN-NEXT:    v_add_f32_e32 v42, 0x40c00000, v63
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v43, 0x40c00000, v43
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v44, 0x40c00000, v44
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v45, 0x40c00000, v45
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v46, 0x40c00000, v46
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v47, 0x40c00000, v47
+; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v56, 0x40c00000, v56
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v57, 0x40c00000, v57
+; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v58, 0x40c00000, v58
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v59, 0x40c00000, v59
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_add_f32_e32 v60, 0x40c00000, v60
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v61, 0x40c00000, v61
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v62, 0x40c00000, v62
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v63, 0x40c00000, v63
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
 ; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
 ; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
@@ -106078,441 +106130,426 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v62, 16, v62
 ; GCN-NEXT:    v_lshrrev_b32_e32 v63, 16, v63
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v63
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v62
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v61, v61
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v61
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v60
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v59
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v58
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v57
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v56
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v47
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v46
+; GCN-NEXT:    buffer_store_dword v46, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v45
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v44
+; GCN-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v43, v43
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v42, v42
+; GCN-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v41
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v40
+; GCN-NEXT:    buffer_store_dword v40, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v55
+; GCN-NEXT:    buffer_store_dword v55, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v54
+; GCN-NEXT:    buffer_store_dword v54, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
+; GCN-NEXT:    buffer_store_dword v53, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v52
+; GCN-NEXT:    buffer_store_dword v52, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v51
+; GCN-NEXT:    buffer_store_dword v51, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v50
+; GCN-NEXT:    buffer_store_dword v50, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v49
+; GCN-NEXT:    buffer_store_dword v49, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
+; GCN-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v39
+; GCN-NEXT:    buffer_store_dword v39, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
+; GCN-NEXT:    buffer_store_dword v38, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
+; GCN-NEXT:    buffer_store_dword v36, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v35
+; GCN-NEXT:    buffer_store_dword v35, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    buffer_store_dword v34, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v34, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
+; GCN-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
-; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v31
-; GCN-NEXT:    v_mov_b32_e32 v31, v61
+; GCN-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
+; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v29
-; GCN-NEXT:    v_mov_b32_e32 v29, v36
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v29
+; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v28, v32
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
-; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v27, v38
+; GCN-NEXT:    buffer_store_dword v27, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v26, v33
+; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
-; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v25, v48
+; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v24, v35
+; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
-; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v23, v50
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v22, v37
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v21, v52
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v20, v39
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v19, v54
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v18, v49
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v17, v40
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v16, v51
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v15, v42
+; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v14, v53
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v13, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v12, v55
+; GCN-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v12
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v11, v46
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v10
 ; GCN-NEXT:    v_mov_b32_e32 v10, v41
-; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v9
 ; GCN-NEXT:    v_mov_b32_e32 v9, v56
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v8
 ; GCN-NEXT:    v_mov_b32_e32 v8, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v7
 ; GCN-NEXT:    v_mov_b32_e32 v7, v45
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v6, v2
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v5
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mov_b32_e32 v5, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v5
+; GCN-NEXT:    v_mov_b32_e32 v5, v2
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v4
+; GCN-NEXT:    v_mov_b32_e32 v4, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v3
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v1
 ; GCN-NEXT:  .LBB49_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v4, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v3, v2
 ; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 4, v0
 ; GCN-NEXT:    buffer_store_dword v2, v1, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v31
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v45, v2, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v44, v2, v1
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, 8, v0
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v47, v2, v1
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, 12, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v6
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_or_b32_e32 v46, v2, v1
 ; GCN-NEXT:    v_add_i32_e32 v57, vcc, 16, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v13
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v2, v1
+; GCN-NEXT:    v_or_b32_e32 v6, v2, v1
 ; GCN-NEXT:    v_add_i32_e32 v56, vcc, 20, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v15
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v8
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v2
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, 24, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v10
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v2
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, 28, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v12
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v2
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, 32, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v21
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v10, v10, v2
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, 36, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v2
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 40, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v18
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v2
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, 44, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v27
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v2
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 48, v0
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v29
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v2
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 52, v0
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v2
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 56, v0
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v26
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v22, v2
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 60, v0
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v13, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 24, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 64, v0
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v22, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 28, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v25, v26, v25
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 0x44, v0
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v25, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 32, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v27, v28, v27
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 0x48, v0
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v27, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 36, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
-; GCN-NEXT:    v_or_b32_e32 v29, v30, v29
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 0x4c, v0
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v31
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v29, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 40, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    v_or_b32_e32 v31, v32, v31
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x50, v0
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v33, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 44, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    v_or_b32_e32 v33, v34, v33
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 0x54, v0
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v35, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 48, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
-; GCN-NEXT:    v_or_b32_e32 v35, v36, v35
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x58, v0
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_or_b32_e32 v37, v2, v1
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 52, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v37, v38, v37
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, 0x5c, v0
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_or_b32_e32 v38, v3, v2
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, 56, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    v_or_b32_e32 v39, v48, v39
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, 0x60, v0
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v3, v2
+; GCN-NEXT:    v_add_i32_e32 v48, vcc, 60, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v49
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    v_or_b32_e32 v49, v50, v49
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, 0x64, v0
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v49, v4, v3
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, 64, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v52
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v51
-; GCN-NEXT:    v_or_b32_e32 v51, v52, v51
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x68, v0
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v51, v4, v3
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x44, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v53
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v54
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_or_b32_e32 v53, v54, v53
-; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x6c, v0
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v53, v4, v3
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x48, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v55
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v40
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    v_or_b32_e32 v55, v40, v55
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x70, v0
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v55, v4, v3
+; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x4c, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v41
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v4, v4, v3
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x50, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v42
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v41
-; GCN-NEXT:    v_or_b32_e32 v41, v42, v41
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x74, v0
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v43, v3
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x78, v0
+; GCN-NEXT:    v_or_b32_e32 v7, v5, v3
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x54, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v9, v5, v3
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x58, v0
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v5, v3
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 0x5c, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v14
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v8, v5
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, 0x60, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v15
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v8, v10, v8
+; GCN-NEXT:    v_add_i32_e32 v15, vcc, 0x64, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v17
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    v_or_b32_e32 v10, v17, v10
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 0x68, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v19
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 0x6c, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v31
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    v_or_b32_e32 v20, v31, v20
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 0x70, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v21
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 0x74, v0
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
 ; GCN-NEXT:    buffer_store_dword v45, v59, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v44, v58, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v47, v57, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v46, v56, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v1, v5, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v4, v7, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v6, v9, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v8, v11, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v10, v13, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v12, v15, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v14, v17, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v16, v19, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v18, v21, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v20, v22, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v2, v24, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v23, v26, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v6, v18, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v13, v24, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v26, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v25, v28, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v27, v30, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v29, v32, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v31, v34, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v29, v34, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v33, v36, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v35, v38, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v37, v48, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v39, v50, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v35, v1, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v37, v39, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v38, v48, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v2, v50, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v49, v52, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v51, v54, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v53, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v55, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v41, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v3, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v55, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v4, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v7, v43, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v9, v11, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v14, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v5, v15, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v8, v17, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v10, v19, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v12, v31, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v20, v21, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v16, v32, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v23, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Reload
@@ -106527,7 +106564,6 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
@@ -107826,21 +107862,19 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v86, 0x40c00000, v86
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v36, v36, v37, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v37, v38, v39, 0x7fff
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v38, 0x400000, v39
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v39, v39
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v36.l, v36.h
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v87, 0x40c00000, v87 :: v_dual_lshlrev_b32 v12, 16, v12
-; GFX11-TRUE16-NEXT:    v_dual_add_f32 v96, 0x40c00000, v96 :: v_dual_cndmask_b32 v21, v37, v38
+; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v21, v37, v38, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v37, v48, v49, 0x7fff
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v38, 0x400000, v49
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v49, v49
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v48, 0x40c00000, v22
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v96, 0x40c00000, v96
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v98, 16, v15
 ; GFX11-TRUE16-NEXT:    v_dual_cndmask_b32 v22, v37, v38 :: v_dual_add_f32 v49, 0x40c00000, v51
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v39, v50, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v38, 0x400000, v50
@@ -107850,7 +107884,7 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v37, v39, v50, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v39, v48, 16, 1
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v50, v49, 16, 1
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v98, 0x40c00000, v98
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v98, 16, v15
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v101, v14, 16, 1
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v37, v37, v38, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v39, v48, 0x7fff
@@ -107858,16 +107892,15 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v48, v48
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v101, v101, v14, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v112, 0x400000, v14
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v102, v98, 16, 1
-; GFX11-TRUE16-NEXT:    v_or_b32_e32 v114, 0x400000, v98
-; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v23, v38, v39, vcc_lo
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT:    v_dual_add_f32 v98, 0x40c00000, v98 :: v_dual_cndmask_b32 v23, v38, v39
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v38, v50, v49, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v39, 0x400000, v49
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v50, 0x40c00000, v52
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v48, v51, 16, 1
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v49, v49
-; GFX11-TRUE16-NEXT:    v_add3_u32 v102, v102, v98, 0x7fff
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v102, v98, 16, 1
+; GFX11-TRUE16-NEXT:    v_or_b32_e32 v114, 0x400000, v98
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v49, v50, 16, 1
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v38, v38, v39, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v39, v48, v51, 0x7fff
@@ -107875,7 +107908,7 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v51, v51
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v53, 0xffff0000, v25
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v52, 0x40c00000, v24 :: v_dual_lshlrev_b32 v25, 16, v25
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT:    v_add3_u32 v102, v102, v98, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v24, v39, v48, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v39, v49, v50, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v48, 0x400000, v50
@@ -108005,8 +108038,8 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v64, v65, v66, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v65, 0x400000, v66
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v66, v66
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v70, 0x40c00000, v1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v66, 0x400000, v67
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v70, 0x40c00000, v1
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v64, v64, v65, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v65, v68, v67, 0x7fff
@@ -108017,8 +108050,8 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v64.l, v64.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v1, v65, v66, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v65, v68, v69, 0x7fff
-; GFX11-TRUE16-NEXT:    v_and_b32_e32 v68, 0xffff0000, v3
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v66, 0x400000, v69
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v68, 0xffff0000, v3
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v69, v69
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v69, v2, 16, 1
@@ -108303,15 +108336,15 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v33, v34, v36, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v34, 0x400000, v36
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v37, v17, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v35, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v36, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v33, v33, v34, vcc_lo
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v39, 0x40c00000, v18
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v34, v37, v35, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v38, 16, 1
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v39, 0x40c00000, v18
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v17, v33, v17, 0x7060302
@@ -108518,14 +108551,12 @@ define <64 x half> @bitcast_v64bf16_to_v64f16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v0, 0x40c00000, v0 :: v_dual_lshlrev_b32 v67, 16, v1
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v30, v54, v30, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v31, v55, v64, vcc_lo
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v67, 0x40c00000, v67
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v55, v65, v66, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v64, 0x400000, v66
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v65, v68, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v66, v66
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v67, 0x40c00000, v67
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v66, v0, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v55, v55, v64, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v64, v65, v68, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v65, 0x400000, v68
@@ -108805,19 +108836,10 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:104
-; GCN-NEXT:    s_waitcnt expcnt(6)
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:136
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:68
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:64
 ; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:60
@@ -108828,22 +108850,38 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:40
 ; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v2
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v5
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v6
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v7
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
@@ -108865,26 +108903,25 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v34
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:136
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:112
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:116
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:120
-; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v49
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
@@ -108899,352 +108936,359 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v43
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v44
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v56
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:112
-; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v57
-; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v31
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v1
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v3
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v57
+; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v59
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v61
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v62
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v2
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v3
+; GCN-NEXT:    ; implicit-def: $vgpr5
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr31
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr9
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr7
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr6
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
+; GCN-NEXT:    ; implicit-def: $vgpr3
+; GCN-NEXT:    ; kill: killed $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
 ; GCN-NEXT:    ; implicit-def: $vgpr4
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr4
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB50_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v2
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v5
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v6
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v7
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v8
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v9
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v10
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v11
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v12
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v10
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v11
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v13
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v12
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v14
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v13
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v15
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v14
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v15
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v16
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v17
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v17
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v18
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v18
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v19
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v19
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v20
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v20
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v21
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v21
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v22
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v22
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v23
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v23
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v24
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v25
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v25
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v26
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v26
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v27
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v27
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v28
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v28
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v29
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v29
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v30
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v30
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v32
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v32
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v33
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v33
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v34
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v34
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v35
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v35
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v36
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v36
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v37
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v37
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v38
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v38
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v39
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v39
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v48
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v48
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v49
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v49
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v50
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v50
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v51
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v51
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v52
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v52
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v53
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v53
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v54
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v54
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v55
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v55
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v40
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v40
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v41
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v41
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v42
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v42
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v43
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v43
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v44
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v44
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v45
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v45
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v46
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v46
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v47
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v47
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v56
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v56
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v57
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v57
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v58
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v58
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v59
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v59
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v60
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v60
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v61
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v61
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v62
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v62
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v63
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v2
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
@@ -109253,11 +109297,6 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr7
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr9
 ; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr12
@@ -109312,16 +109351,16 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr31
+; GCN-NEXT:    ; implicit-def: $vgpr2
 ; GCN-NEXT:  .LBB50_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB50_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f32_f16_e32 v63, v63
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v63
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v62, v62
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v61, v61
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v60, v60
@@ -109362,34 +109401,44 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v21
+; GCN-NEXT:    v_cvt_f32_f16_e32 v63, v21
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v19
-; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v19
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v18
+; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v17
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v16
+; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v15
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:464 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
@@ -109400,8 +109449,8 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
 ; GCN-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v63
-; GCN-NEXT:    v_add_f32_e32 v62, 0x38000000, v62
+; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v62
 ; GCN-NEXT:    v_add_f32_e32 v61, 0x38000000, v61
 ; GCN-NEXT:    v_add_f32_e32 v60, 0x38000000, v60
 ; GCN-NEXT:    v_add_f32_e32 v59, 0x38000000, v59
@@ -109441,51 +109490,53 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v24, 0x38000000, v24
 ; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
 ; GCN-NEXT:    v_add_f32_e32 v22, 0x38000000, v22
-; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_add_f32_e32 v62, 0x38000000, v63
 ; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
-; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
-; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
-; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
-; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
-; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
+; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
+; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
+; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
-; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
-; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
-; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
-; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
+; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
+; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
+; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
 ; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
+; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
 ; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v2
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v7
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v21
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v15
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v11
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v12
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v3
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v9
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v62
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v24
@@ -109525,7 +109576,7 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v59
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v61
-; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v62
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
@@ -109534,489 +109585,470 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v19
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v6
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v8
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v10
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v11
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
+; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
+; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v12
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v13
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v13
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v14
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v14
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v15
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
+; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
+; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v16
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v7, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
+; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v17
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mov_b32_e32 v8, v21
+; GCN-NEXT:    v_mov_b32_e32 v9, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v6
+; GCN-NEXT:    v_mov_b32_e32 v6, v11
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v18
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v3
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v4, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v7
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v62
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v22
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v23
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v24
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v25
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v26
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v27
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v28
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v29
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v30
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v32
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v33
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v34
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v35
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v36
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v37
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v38
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v39
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v48
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v49
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v50
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v51
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v52
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v53
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v54
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v55
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v40
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v41
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v42
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v43
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v44
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v45
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v46
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v47
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v56
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v57
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v58
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v59
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v60
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v61
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v62
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v31
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v4
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v31
+; GCN-NEXT:    v_mov_b32_e32 v31, v19
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v63
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v1
 ; GCN-NEXT:  .LBB50_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v3
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v5
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_alignbit_b32 v1, v1, v2, 16
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v8
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v2, v2, v3, 16
+; GCN-NEXT:    v_alignbit_b32 v2, v2, v5, 16
 ; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 4, v0
 ; GCN-NEXT:    buffer_store_dword v2, v1, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v31
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_alignbit_b32 v45, v1, v2, 16
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_alignbit_b32 v44, v1, v2, 16
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, 8, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v4
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v9
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_alignbit_b32 v47, v1, v2, 16
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, 12, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v7
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_alignbit_b32 v46, v1, v2, 16
 ; GCN-NEXT:    v_add_i32_e32 v57, vcc, 16, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v2, v1, v2, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v1, v2, 16
 ; GCN-NEXT:    v_add_i32_e32 v56, vcc, 20, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v3, v1, v3, 16
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, 24, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v5
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v5, v1, v5, 16
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, 28, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v6
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v5, v2, v5, 16
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 24, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v7, 1.0, v7
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v7, v1, v7, 16
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 32, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v7, v2, v7, 16
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 28, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v9, 1.0, v9
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v9, v1, v9, 16
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, 36, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v9, v2, v9, 16
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, 32, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v11, v1, v11, 16
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, 40, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v11, v2, v11, 16
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, 36, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v13
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v13, v1, v13, 16
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, 44, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v13, v2, v13, 16
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, 40, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v15, 1.0, v15
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v15, v1, v15, 16
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 48, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v15, v2, v15, 16
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 44, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v17
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v17, v1, v17, 16
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 52, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v17, v2, v17, 16
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 48, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v19, 1.0, v19
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v19, v1, v19, 16
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 56, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v19, v2, v19, 16
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 52, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v21, 1.0, v21
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v21, v1, v21, 16
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 60, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v21, v2, v21, 16
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 56, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v23, 1.0, v23
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v23, v1, v23, 16
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 64, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v23, v2, v23, 16
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 60, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v25, 1.0, v25
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v25, v1, v25, 16
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 0x44, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v25, v2, v25, 16
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 64, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v27, 1.0, v27
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v27, v1, v27, 16
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 0x48, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v27, v2, v27, 16
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 0x44, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v29, 1.0, v29
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v29, v1, v29, 16
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 0x4c, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v29, v2, v29, 16
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 0x48, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v31, 1.0, v31
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v31, v1, v31, 16
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x50, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v31, v2, v31, 16
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x4c, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v33
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v33, v1, v33, 16
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 0x54, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v33, v2, v33, 16
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 0x50, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v35
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v35, v1, v35, 16
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x58, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v35, v2, v35, 16
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x54, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v37, v1, v37, 16
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, 0x5c, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v37, v2, v37, 16
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, 0x58, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v39, 1.0, v39
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v39, v1, v39, 16
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, 0x60, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v39, v2, v39, 16
+; GCN-NEXT:    v_add_i32_e32 v48, vcc, 0x5c, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v49, 1.0, v49
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v49, v1, v49, 16
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, 0x64, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v49, v2, v49, 16
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, 0x60, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v51, 1.0, v51
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v51, v1, v51, 16
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x68, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v51, v2, v51, 16
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x64, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v53
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v53, v1, v53, 16
-; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x6c, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v53, v2, v53, 16
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x68, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v55
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v55, v1, v55, 16
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x70, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v55, v2, v55, 16
+; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x6c, v0
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v41
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v41, v1, v41, 16
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x74, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v2, v2, v41, 16
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x70, v0
+; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v42, 1.0, v42
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_alignbit_b32 v3, v3, v42, 16
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x74, v0
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v43
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v1, v1, v43, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v4, v4, v43, 16
 ; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
 ; GCN-NEXT:    buffer_store_dword v45, v59, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v44, v58, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v47, v57, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v46, v56, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v2, v4, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v3, v6, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v1, v6, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v5, v8, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v7, v10, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v9, v12, s[0:3], 0 offen
@@ -110038,9 +110070,10 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v49, v52, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v51, v54, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v53, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v55, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v41, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v55, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v2, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v43, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Reload
@@ -110055,7 +110088,6 @@ define <64 x bfloat> @bitcast_v64f16_to_v64bf16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
@@ -110315,751 +110347,788 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:44
-; GCN-NEXT:    s_waitcnt expcnt(6)
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:40
-; GCN-NEXT:    s_waitcnt expcnt(5)
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:36
-; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:32
-; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v1
-; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v2
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:8
+; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v4
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v5
-; GCN-NEXT:    v_mul_f32_e32 v33, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v31, 1.0, v5
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v6
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v8
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v32, 1.0, v9
-; GCN-NEXT:    v_mul_f32_e32 v31, 1.0, v10
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v9
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v10
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v10, 1.0, v13
-; GCN-NEXT:    v_mul_f32_e32 v9, 1.0, v14
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v13
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v14
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v8, 1.0, v17
-; GCN-NEXT:    v_mul_f32_e32 v6, 1.0, v18
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v18
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v21
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v22
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v21
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v22
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v25
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v26
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v27
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:412 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v28
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:404 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v29
-; GCN-NEXT:    v_mul_f32_e32 v61, 1.0, v30
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v40
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v27
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
-; GCN-NEXT:    v_mul_f32_e32 v15, 1.0, v55
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:428 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v28
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v15, 1.0, v38
-; GCN-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:424 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(4)
-; GCN-NEXT:    v_mul_f32_e32 v41, 1.0, v14
-; GCN-NEXT:    v_mul_f32_e32 v25, 1.0, v13
-; GCN-NEXT:    v_mul_f32_e32 v12, 1.0, v12
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:436 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v11
-; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:432 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v7
-; GCN-NEXT:    v_mul_f32_e32 v24, 1.0, v5
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v59
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v29
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v58
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:440 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v23, 1.0, v57
-; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v56
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v30
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v47
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:136
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v45
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v44
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v46
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v21, 1.0, v45
-; GCN-NEXT:    v_mul_f32_e32 v20, 1.0, v44
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v43
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v43
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v42
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v20, 1.0, v41
+; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v40
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v42
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:456 ; 4-byte Folded Spill
-; GCN-NEXT:    v_mul_f32_e32 v19, 1.0, v54
-; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v37
-; GCN-NEXT:    v_mul_f32_e32 v57, 1.0, v53
-; GCN-NEXT:    v_mul_f32_e32 v43, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v18, 1.0, v51
-; GCN-NEXT:    v_mul_f32_e32 v15, 1.0, v50
-; GCN-NEXT:    v_mul_f32_e32 v59, 1.0, v49
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v55
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:112
-; GCN-NEXT:    v_mul_f32_e32 v60, 1.0, v48
-; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v39
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v54
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Spill
+; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v53
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v52
+; GCN-NEXT:    v_mul_f32_e32 v8, 1.0, v51
+; GCN-NEXT:    v_mul_f32_e32 v7, 1.0, v50
+; GCN-NEXT:    v_mul_f32_e32 v63, 1.0, v49
+; GCN-NEXT:    v_mul_f32_e32 v62, 1.0, v48
+; GCN-NEXT:    v_mul_f32_e32 v10, 1.0, v39
+; GCN-NEXT:    v_mul_f32_e32 v9, 1.0, v38
+; GCN-NEXT:    v_mul_f32_e32 v38, 1.0, v37
+; GCN-NEXT:    v_mul_f32_e32 v37, 1.0, v36
+; GCN-NEXT:    v_mul_f32_e32 v12, 1.0, v35
+; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v34
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_mul_f32_e32 v18, 1.0, v1
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_mul_f32_e32 v17, 1.0, v13
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v5
+; GCN-NEXT:    v_mul_f32_e32 v14, 1.0, v14
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v7, 1.0, v7
-; GCN-NEXT:    v_mul_f32_e32 v5, 1.0, v16
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:132
-; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v26
+; GCN-NEXT:    v_mul_f32_e32 v13, 1.0, v19
+; GCN-NEXT:    v_mul_f32_e32 v35, 1.0, v6
+; GCN-NEXT:    v_mul_f32_e32 v19, 1.0, v15
+; GCN-NEXT:    v_mul_f32_e32 v16, 1.0, v16
+; GCN-NEXT:    v_mul_f32_e32 v15, 1.0, v21
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:132
+; GCN-NEXT:    v_mul_f32_e32 v36, 1.0, v23
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_mul_f32_e32 v14, 1.0, v11
+; GCN-NEXT:    v_mul_f32_e32 v34, 1.0, v1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_mul_f32_e32 v12, 1.0, v12
+; GCN-NEXT:    v_mul_f32_e32 v6, 1.0, v6
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v11, 1.0, v27
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    v_mul_f32_e32 v21, 1.0, v21
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr61
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr48
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr50
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr49
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr48
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr39
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr28
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr45
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr28
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr27
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr24
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; kill: killed $vgpr23
 ; GCN-NEXT:    ; implicit-def: $vgpr53
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr49
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; kill: killed $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr39
-; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; kill: killed $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr26
+; GCN-NEXT:    ; implicit-def: $vgpr23
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB51_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v36
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v33
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v35
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v32
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v62, 16, v26
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v52, 16, v26
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v34
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v42, 16, v23
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v31
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v33
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v61, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v58, 16, v26
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v54, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v48, 16, v26
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v32
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v31
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v56, 16, v26
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v60, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v26
-; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v52, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v47, 16, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v9
-; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v46, 16, v6
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v59, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v6
-; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v51, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v45, 16, v3
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v44, 16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v58, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v63
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v50, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v61
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v42, 16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v57, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v41
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v49, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v40, 16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v56, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v1
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v38
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v48, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v55, 16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v47, 16, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v23
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v23
+; GCN-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v54, 16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v46, 16, v4
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v20
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v53, 16, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshrrev_b32_e32 v45, 16, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v37
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v51, 16, v57
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v3, 16, v22
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v44, 16, v8
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v7
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v43
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v63
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v18
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v62
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v43, 16, v10
+; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v50, 16, v59
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v38
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v60
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v37
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v41, 16, v12
+; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v11
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v18
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v49, 16, v7
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v17
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v40, 16, v14
+; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v35
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v19
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v55, 16, v16
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v15
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v14
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v12
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v36
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    ; implicit-def: $vgpr36
-; GCN-NEXT:    ; implicit-def: $vgpr35
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v34
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v53, 16, v6
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v21
 ; GCN-NEXT:    ; implicit-def: $vgpr33
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr32
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
 ; GCN-NEXT:    ; implicit-def: $vgpr31
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr10
-; GCN-NEXT:    ; implicit-def: $vgpr9
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr8
-; GCN-NEXT:    ; implicit-def: $vgpr6
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr5
 ; GCN-NEXT:    ; implicit-def: $vgpr4
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr20
 ; GCN-NEXT:    ; implicit-def: $vgpr3
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; kill: killed $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr2
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; kill: killed $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr22
+; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr7
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr61
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr41
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
+; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr9
 ; GCN-NEXT:    ; implicit-def: $vgpr38
-; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr23
-; GCN-NEXT:    ; implicit-def: $vgpr22
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr20
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr5
-; GCN-NEXT:    ; kill: killed $vgpr5
-; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr37
-; GCN-NEXT:    ; implicit-def: $vgpr57
-; GCN-NEXT:    ; implicit-def: $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr12
+; GCN-NEXT:    ; implicit-def: $vgpr11
 ; GCN-NEXT:    ; implicit-def: $vgpr18
-; GCN-NEXT:    ; implicit-def: $vgpr15
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr14
 ; GCN-NEXT:    ; implicit-def: $vgpr13
-; GCN-NEXT:    ; implicit-def: $vgpr7
-; GCN-NEXT:    ; implicit-def: $vgpr5
+; GCN-NEXT:    ; implicit-def: $vgpr35
+; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr12
-; GCN-NEXT:    ; implicit-def: $vgpr11
+; GCN-NEXT:    ; implicit-def: $vgpr15
+; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr6
+; GCN-NEXT:    ; implicit-def: $vgpr21
 ; GCN-NEXT:  .LBB51_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB51_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v36
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v35
-; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v26
-; GCN-NEXT:    v_add_f32_e32 v35, 0x40c00000, v27
-; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v35
-; GCN-NEXT:    v_alignbit_b32 v26, v27, v26, 16
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v34
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v33
-; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v26
-; GCN-NEXT:    v_add_f32_e32 v33, 0x40c00000, v27
-; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v33
-; GCN-NEXT:    v_alignbit_b32 v26, v27, v26, 16
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v33
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v32
+; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v1
+; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v23
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v1
+; GCN-NEXT:    v_alignbit_b32 v23, v23, v24, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v32
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v31
-; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v26
-; GCN-NEXT:    v_add_f32_e32 v31, 0x40c00000, v27
-; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v31
-; GCN-NEXT:    v_alignbit_b32 v26, v27, v26, 16
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
-; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v31
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v31, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v31
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v9
-; GCN-NEXT:    v_alignbit_b32 v10, v26, v10, 16
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
-; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v32, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v32
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v33, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v33
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v51, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v51
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v50, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v50
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v49, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v49
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v6
-; GCN-NEXT:    v_alignbit_b32 v8, v10, v8, 16
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v48, 0x40c00000, v24
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v48
+; GCN-NEXT:    v_alignbit_b32 v23, v24, v23, 16
+; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v4
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v4
+; GCN-NEXT:    v_alignbit_b32 v5, v23, v5, 16
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v20
+; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v3
+; GCN-NEXT:    v_alignbit_b32 v5, v20, v5, 16
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v3
-; GCN-NEXT:    v_alignbit_b32 v4, v8, v4, 16
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v22
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v2
-; GCN-NEXT:    v_add_f32_e32 v1, 0x40c00000, v1
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v2, v4, v2, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v63
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v61
-; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v2
-; GCN-NEXT:    v_add_f32_e32 v2, 0x40c00000, v4
-; GCN-NEXT:    v_lshrrev_b32_e32 v4, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v4, v4, v8, 16
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v41
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v25
-; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v4
-; GCN-NEXT:    v_add_f32_e32 v4, 0x40c00000, v8
-; GCN-NEXT:    v_lshrrev_b32_e32 v8, 16, v4
-; GCN-NEXT:    v_alignbit_b32 v8, v8, v10, 16
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v38
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v24
-; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v8
-; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v10
-; GCN-NEXT:    v_lshrrev_b32_e32 v10, 16, v8
-; GCN-NEXT:    v_alignbit_b32 v10, v10, v24, 16
-; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v23
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v10
-; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v22
-; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v10
-; GCN-NEXT:    v_alignbit_b32 v22, v22, v23, 16
-; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v20
-; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v21
-; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v20
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v2
+; GCN-NEXT:    v_alignbit_b32 v5, v20, v5, 16
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v63
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v62
+; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v5
+; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v20
+; GCN-NEXT:    v_lshrrev_b32_e32 v20, 16, v5
+; GCN-NEXT:    v_alignbit_b32 v20, v20, v22, 16
+; GCN-NEXT:    buffer_store_dword v20, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v38
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v37
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v20
+; GCN-NEXT:    v_add_f32_e32 v20, 0x40c00000, v22
 ; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v20
-; GCN-NEXT:    v_alignbit_b32 v21, v22, v21, 16
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v37
-; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v19
-; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v21
-; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v19
-; GCN-NEXT:    v_alignbit_b32 v21, v21, v22, 16
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v22, v22, v23, 16
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
-; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
+; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v15
-; GCN-NEXT:    v_alignbit_b32 v18, v21, v18, 16
+; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v17
+; GCN-NEXT:    v_alignbit_b32 v18, v22, v18, 16
 ; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
-; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v18, 16, v13
-; GCN-NEXT:    v_alignbit_b32 v17, v18, v17, 16
-; GCN-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v35
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
+; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v18
+; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v19
+; GCN-NEXT:    v_lshrrev_b32_e32 v19, 16, v18
+; GCN-NEXT:    v_alignbit_b32 v19, v19, v22, 16
+; GCN-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v36
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v34
+; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v19
+; GCN-NEXT:    v_add_f32_e32 v19, 0x40c00000, v22
+; GCN-NEXT:    v_lshrrev_b32_e32 v22, 16, v19
+; GCN-NEXT:    v_alignbit_b32 v22, v22, v23, 16
+; GCN-NEXT:    buffer_store_dword v22, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
-; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshrrev_b32_e32 v17, 16, v14
-; GCN-NEXT:    v_alignbit_b32 v16, v17, v16, 16
-; GCN-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
+; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff0000, v59
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v60
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v57
-; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v43
-; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:456 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:452 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:448 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:444 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:440 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:436 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:432 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:428 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v30
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:424 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v32
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:412 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v34, 0xffff0000, v34
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:404 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v35, 0xffff0000, v35
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v36, 0xffff0000, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xffff0000, v37
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v38
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v48
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v49
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v50
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v51
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v52
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v53
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v54, 0xffff0000, v54
 ; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
@@ -111068,392 +111137,362 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v40, 0xffff0000, v40
+; GCN-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
+; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v21
+; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
+; GCN-NEXT:    v_add_f32_e32 v15, 0x40c00000, v15
+; GCN-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
+; GCN-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
+; GCN-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
+; GCN-NEXT:    v_add_f32_e32 v9, 0x40c00000, v9
+; GCN-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x40c00000, v7
-; GCN-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
-; GCN-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
-; GCN-NEXT:    v_add_f32_e32 v17, 0x40c00000, v17
-; GCN-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
-; GCN-NEXT:    v_add_f32_e32 v21, 0x40c00000, v21
 ; GCN-NEXT:    v_add_f32_e32 v22, 0x40c00000, v22
-; GCN-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
-; GCN-NEXT:    v_add_f32_e32 v24, 0x40c00000, v24
-; GCN-NEXT:    v_add_f32_e32 v25, 0x40c00000, v25
-; GCN-NEXT:    v_add_f32_e32 v41, 0x40c00000, v26
-; GCN-NEXT:    v_add_f32_e32 v26, 0x40c00000, v27
-; GCN-NEXT:    v_add_f32_e32 v42, 0x40c00000, v28
-; GCN-NEXT:    v_add_f32_e32 v27, 0x40c00000, v29
-; GCN-NEXT:    v_add_f32_e32 v43, 0x40c00000, v30
-; GCN-NEXT:    v_add_f32_e32 v28, 0x40c00000, v32
-; GCN-NEXT:    v_add_f32_e32 v32, 0x40c00000, v34
-; GCN-NEXT:    v_add_f32_e32 v29, 0x40c00000, v36
-; GCN-NEXT:    v_add_f32_e32 v34, 0x40c00000, v37
-; GCN-NEXT:    v_add_f32_e32 v30, 0x40c00000, v38
-; GCN-NEXT:    v_add_f32_e32 v36, 0x40c00000, v39
-; GCN-NEXT:    v_add_f32_e32 v37, 0x40c00000, v48
-; GCN-NEXT:    v_add_f32_e32 v38, 0x40c00000, v49
-; GCN-NEXT:    v_add_f32_e32 v39, 0x40c00000, v50
-; GCN-NEXT:    v_add_f32_e32 v56, 0x40c00000, v51
-; GCN-NEXT:    v_add_f32_e32 v48, 0x40c00000, v52
-; GCN-NEXT:    v_add_f32_e32 v57, 0x40c00000, v53
-; GCN-NEXT:    v_add_f32_e32 v49, 0x40c00000, v54
-; GCN-NEXT:    v_add_f32_e32 v59, 0x40c00000, v55
-; GCN-NEXT:    v_add_f32_e32 v50, 0x40c00000, v40
-; GCN-NEXT:    v_lshrrev_b32_e32 v51, 16, v11
-; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v14
-; GCN-NEXT:    v_lshrrev_b32_e32 v53, 16, v5
-; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v13
-; GCN-NEXT:    v_lshrrev_b32_e32 v54, 16, v17
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v15
-; GCN-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff0000, v19
-; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v20
-; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GCN-NEXT:    v_lshrrev_b32_e32 v40, 16, v27
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GCN-NEXT:    v_lshrrev_b32_e32 v44, 16, v28
+; GCN-NEXT:    v_add_f32_e32 v41, 0x40c00000, v23
+; GCN-NEXT:    v_add_f32_e32 v46, 0x40c00000, v24
+; GCN-NEXT:    v_add_f32_e32 v42, 0x40c00000, v25
+; GCN-NEXT:    v_add_f32_e32 v47, 0x40c00000, v26
+; GCN-NEXT:    v_add_f32_e32 v43, 0x40c00000, v27
+; GCN-NEXT:    v_add_f32_e32 v56, 0x40c00000, v28
+; GCN-NEXT:    v_add_f32_e32 v44, 0x40c00000, v29
+; GCN-NEXT:    v_add_f32_e32 v57, 0x40c00000, v30
+; GCN-NEXT:    v_add_f32_e32 v34, 0x40c00000, v34
+; GCN-NEXT:    v_add_f32_e32 v35, 0x40c00000, v35
+; GCN-NEXT:    v_add_f32_e32 v36, 0x40c00000, v36
+; GCN-NEXT:    v_add_f32_e32 v37, 0x40c00000, v37
+; GCN-NEXT:    v_add_f32_e32 v38, 0x40c00000, v38
+; GCN-NEXT:    v_add_f32_e32 v60, 0x40c00000, v39
+; GCN-NEXT:    v_add_f32_e32 v52, 0x40c00000, v52
+; GCN-NEXT:    v_add_f32_e32 v61, 0x40c00000, v53
+; GCN-NEXT:    v_add_f32_e32 v53, 0x40c00000, v54
+; GCN-NEXT:    v_add_f32_e32 v62, 0x40c00000, v55
+; GCN-NEXT:    v_add_f32_e32 v55, 0x40c00000, v40
+; GCN-NEXT:    v_lshrrev_b32_e32 v23, 16, v21
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
+; GCN-NEXT:    v_lshrrev_b32_e32 v24, 16, v15
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff0000, v18
+; GCN-NEXT:    v_lshrrev_b32_e32 v25, 16, v13
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff0000, v17
+; GCN-NEXT:    v_lshrrev_b32_e32 v26, 16, v11
+; GCN-NEXT:    v_and_b32_e32 v11, 0xffff0000, v20
+; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v9
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
+; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v7
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
-; GCN-NEXT:    v_lshrrev_b32_e32 v45, 16, v29
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GCN-NEXT:    v_lshrrev_b32_e32 v27, 16, v30
+; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v41
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v28, 16, v37
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GCN-NEXT:    v_lshrrev_b32_e32 v29, 16, v39
-; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v48
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v31
-; GCN-NEXT:    v_lshrrev_b32_e32 v48, 16, v49
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff0000, v33
-; GCN-NEXT:    v_lshrrev_b32_e32 v52, 16, v50
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v35
-; GCN-NEXT:    buffer_store_dword v51, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v39, v51, v12, 16
-; GCN-NEXT:    buffer_store_dword v53, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v49, v53, v7, 16
-; GCN-NEXT:    buffer_store_dword v54, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v50, v54, v16, 16
-; GCN-NEXT:    buffer_store_dword v21, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    v_alignbit_b32 v51, v21, v18, 16
-; GCN-NEXT:    buffer_store_dword v23, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    v_alignbit_b32 v53, v23, v22, 16
-; GCN-NEXT:    buffer_store_dword v25, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    v_alignbit_b32 v54, v25, v24, 16
-; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v55, v26, v41, 16
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v26, v40
-; GCN-NEXT:    v_alignbit_b32 v40, v26, v42, 16
-; GCN-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v42, v44, v43, 16
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_alignbit_b32 v44, v45, v32, 16
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v45, v27, v34, 16
-; GCN-NEXT:    v_alignbit_b32 v46, v28, v36, 16
-; GCN-NEXT:    v_alignbit_b32 v47, v29, v38, 16
-; GCN-NEXT:    v_alignbit_b32 v56, v30, v56, 16
-; GCN-NEXT:    v_alignbit_b32 v58, v48, v57, 16
-; GCN-NEXT:    v_alignbit_b32 v62, v52, v59, 16
-; GCN-NEXT:    v_alignbit_b32 v7, v62, v20, 16
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v7, v58, v19, 16
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v7, v56, v17, 16
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v7, v47, v9, 16
-; GCN-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v6, v46, v6, 16
-; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v3, v45, v3, 16
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v1, v44, v1, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshrrev_b32_e32 v30, 16, v42
+; GCN-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
+; GCN-NEXT:    v_lshrrev_b32_e32 v39, 16, v43
+; GCN-NEXT:    v_and_b32_e32 v7, 0xffff0000, v48
+; GCN-NEXT:    v_lshrrev_b32_e32 v48, 16, v44
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff0000, v49
+; GCN-NEXT:    v_lshrrev_b32_e32 v49, 16, v34
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff0000, v50
+; GCN-NEXT:    v_lshrrev_b32_e32 v50, 16, v36
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v51
+; GCN-NEXT:    v_lshrrev_b32_e32 v51, 16, v38
+; GCN-NEXT:    v_and_b32_e32 v20, 0xffff0000, v33
+; GCN-NEXT:    v_lshrrev_b32_e32 v52, 16, v52
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff0000, v32
+; GCN-NEXT:    v_lshrrev_b32_e32 v54, 16, v53
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff0000, v31
+; GCN-NEXT:    v_lshrrev_b32_e32 v42, 16, v55
+; GCN-NEXT:    v_and_b32_e32 v32, 0xffff0000, v1
+; GCN-NEXT:    v_alignbit_b32 v53, v23, v6, 16
+; GCN-NEXT:    v_alignbit_b32 v55, v24, v16, 16
+; GCN-NEXT:    v_alignbit_b32 v40, v25, v14, 16
+; GCN-NEXT:    v_alignbit_b32 v41, v26, v12, 16
+; GCN-NEXT:    v_alignbit_b32 v43, v27, v10, 16
+; GCN-NEXT:    v_alignbit_b32 v44, v28, v8, 16
+; GCN-NEXT:    v_alignbit_b32 v45, v29, v22, 16
+; GCN-NEXT:    v_alignbit_b32 v46, v30, v46, 16
+; GCN-NEXT:    v_alignbit_b32 v47, v39, v47, 16
+; GCN-NEXT:    v_alignbit_b32 v56, v48, v56, 16
+; GCN-NEXT:    v_alignbit_b32 v57, v49, v57, 16
+; GCN-NEXT:    v_alignbit_b32 v58, v50, v35, 16
+; GCN-NEXT:    v_alignbit_b32 v59, v51, v37, 16
+; GCN-NEXT:    v_alignbit_b32 v60, v52, v60, 16
+; GCN-NEXT:    v_alignbit_b32 v61, v54, v61, 16
+; GCN-NEXT:    v_alignbit_b32 v1, v42, v62, 16
+; GCN-NEXT:    v_alignbit_b32 v6, v1, v32, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:344 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v42, v2, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v61, v31, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:336 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v40, v4, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v60, v21, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v55, v8, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:388 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v59, v20, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v54, v10, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:392 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v58, v18, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v53, v15, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:396 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v57, v17, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:340 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v51, v14, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:400 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v56, v9, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:348 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v50, v13, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:408 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v6, v47, v7, 16
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:352 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v4, v46, v4, 16
+; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:356 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v3, v45, v3, 16
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:360 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v2, v44, v2, 16
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v49, v5, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:416 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v2, v43, v5, 16
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:368 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v1, v39, v11, 16
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:420 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v2, v41, v11, 16
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v2, v40, v13, 16
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:376 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v2, v55, v15, 16
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:380 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v2, v53, v19, 16
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:384 ; 4-byte Folded Spill
 ; GCN-NEXT:  .LBB51_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:344 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v62
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
-; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v42
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
+; GCN-NEXT:    buffer_store_dword v2, v0, s[0:3], 0 offen
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 4, v0
+; GCN-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 4, v0
-; GCN-NEXT:    buffer_store_dword v2, v1, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:336 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v59, v1, v2
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v58
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v57, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v63, vcc, 8, v0
+; GCN-NEXT:    v_or_b32_e32 v62, v1, v2
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v61
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v54
+; GCN-NEXT:    v_or_b32_e32 v61, v1, v2
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, 8, v0
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v58, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v62, vcc, 12, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v56, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v61, vcc, 16, v0
+; GCN-NEXT:    v_or_b32_e32 v63, v1, v2
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 12, v0
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v60
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v52
+; GCN-NEXT:    v_or_b32_e32 v60, v1, v2
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 16, v0
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:328 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v60, vcc, 20, v0
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v29
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 20, v0
+; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v59
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v51
+; GCN-NEXT:    v_or_b32_e32 v59, v3, v4
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 24, v0
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:332 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 28, v0
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v46
-; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v28
-; GCN-NEXT:    v_or_b32_e32 v6, v6, v7
+; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v50
+; GCN-NEXT:    v_or_b32_e32 v58, v7, v8
 ; GCN-NEXT:    v_add_i32_e32 v7, vcc, 32, v0
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:340 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v9
 ; GCN-NEXT:    v_add_i32_e32 v9, vcc, 36, v0
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v45
-; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v10, v10, v11
+; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v57
+; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v49
+; GCN-NEXT:    v_or_b32_e32 v57, v11, v12
 ; GCN-NEXT:    v_add_i32_e32 v11, vcc, 40, v0
 ; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:348 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v13
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 44, v0
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v44
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v48
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 48, v0
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:352 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 52, v0
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v42
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v47
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v39
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v19
 ; GCN-NEXT:    v_add_i32_e32 v19, vcc, 56, v0
 ; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:356 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v21
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 60, v0
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v40
-; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v23
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 64, v0
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:388 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v24, v24, v25
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 0x44, v0
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v55
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v27
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x48, v0
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v46
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v30
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 64, v0
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:392 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:360 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v32
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 0x44, v0
+; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v45
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
-; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 0x4c, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v54
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 0x50, v0
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v32, 0xffff, v32
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:396 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v29, v33, v29
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 0x48, v0
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
-; GCN-NEXT:    v_or_b32_e32 v32, v32, v33
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 0x54, v0
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff, v53
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff, v34
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:364 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v35
 ; GCN-NEXT:    v_or_b32_e32 v34, v34, v35
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, 0x58, v0
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    v_add_i32_e32 v35, vcc, 0x4c, v0
+; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v28, v36, v28
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 0x50, v0
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v36, 0xffff, v36
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:400 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:368 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v36, v36, v37
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, 0x5c, v0
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff, v51
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
+; GCN-NEXT:    v_or_b32_e32 v37, v37, v38
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, 0x54, v0
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff, v43
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
+; GCN-NEXT:    v_or_b32_e32 v27, v39, v27
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x58, v0
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v38, v38, v48
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, 0x60, v0
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff, v48
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:372 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
+; GCN-NEXT:    v_or_b32_e32 v48, v48, v49
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x5c, v0
+; GCN-NEXT:    v_and_b32_e32 v50, 0xffff, v41
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_or_b32_e32 v26, v50, v26
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, 0x60, v0
 ; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v51, 0xffff, v51
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:408 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:376 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v52
 ; GCN-NEXT:    v_or_b32_e32 v51, v51, v52
 ; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x64, v0
-; GCN-NEXT:    v_and_b32_e32 v50, 0xffff, v50
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_or_b32_e32 v50, v50, v53
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x68, v0
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v54, 0xffff, v54
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:416 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    v_or_b32_e32 v54, v54, v55
-; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x6c, v0
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff, v49
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v54, 0xffff, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
+; GCN-NEXT:    v_or_b32_e32 v25, v54, v25
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x68, v0
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    v_or_b32_e32 v49, v49, v40
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x70, v0
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v40, 0xffff, v40
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:380 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v41, 0xffff, v41
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:420 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v41
+; GCN-NEXT:    v_or_b32_e32 v40, v40, v41
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x6c, v0
+; GCN-NEXT:    v_and_b32_e32 v55, 0xffff, v55
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
+; GCN-NEXT:    v_or_b32_e32 v24, v55, v24
+; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x70, v0
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
-; GCN-NEXT:    v_or_b32_e32 v41, v41, v42
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x74, v0
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff, v39
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v42, 0xffff, v42
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:384 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    v_or_b32_e32 v39, v39, v43
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x78, v0
+; GCN-NEXT:    v_or_b32_e32 v42, v42, v43
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x74, v0
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff, v53
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v53, v23
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
-; GCN-NEXT:    buffer_store_dword v59, v63, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v57, v62, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v58, v61, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v56, v60, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v62, v10, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v61, v6, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v63, v2, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v60, v1, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v1, v3, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v2, v5, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v59, v5, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v4, v7, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v6, v9, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v58, v9, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v8, v11, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v10, v13, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v57, v13, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v12, v15, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v14, v17, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v16, v19, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v18, v21, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v20, v23, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v22, v25, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v24, v27, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v26, v29, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v28, v31, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v30, v33, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v32, v35, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v34, v37, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v36, v48, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v38, v52, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v51, v53, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v50, v55, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v54, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v49, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v41, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v39, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v20, v30, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v32, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v31, v33, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v29, v35, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v34, v36, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v38, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v37, v39, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v27, v49, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v48, v50, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v26, v52, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v51, v54, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v25, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v40, v55, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v24, v43, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v42, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v23, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Reload
@@ -111467,8 +111506,8 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:180 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
@@ -112986,11 +113025,10 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v70, v70, v3, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v69, v69, v67, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v80, 0x400000, v67
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v4, 0x40c00000, v4 :: v_dual_cndmask_b32 v3, v70, v71
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v71, 16, v5
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v67, v67
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v66, 0x40c00000, v66
+; GFX11-TRUE16-NEXT:    v_dual_add_f32 v66, 0x40c00000, v66 :: v_dual_lshlrev_b32 v71, 16, v5
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v3.l, v3.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v67, v69, v80, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v69, v4, 16, 1
@@ -113021,15 +113059,14 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 16, v66
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v5, v70, v81, vcc_lo
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v81, 16, v7
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v71, v71
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v81, 16, v7
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v71, v6, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v80, 0x40c00000, v80
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v5.l, v5.h
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v4, v4, 16, v66
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v69, v69, v82, vcc_lo
-; GFX11-TRUE16-NEXT:    v_add3_u32 v71, v71, v6, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v82, 0x400000, v6
+; GFX11-TRUE16-NEXT:    v_add3_u32 v71, v71, v6, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v70, v80, 16, 1
@@ -113055,20 +113092,19 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v70, 16, v70
 ; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v5, v5, 16, v69
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v7, v80, v83, vcc_lo
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v83, 16, v9
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v81, v81
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v83, 16, v9
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v81, v8, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v82, 0x40c00000, v82
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 16, v67
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v67, 16, v68
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v4, v4, 16, v66
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v71, v71, v84, vcc_lo
-; GFX11-TRUE16-NEXT:    v_add3_u32 v81, v81, v8, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v84, 0x400000, v8
+; GFX11-TRUE16-NEXT:    v_add3_u32 v81, v81, v8, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v80, v82, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v85, 0x400000, v82
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v68, 16, v2
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 16, v67
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_dual_cndmask_b32 v8, v81, v84 :: v_dual_add_f32 v9, 0x40c00000, v9
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v84, 16, v10
@@ -113077,7 +113113,7 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v83, 0x40c00000, v83
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v82, v9, 16, 1
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v69, 16, v1
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v67, 16, v68
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v80, v80, v85, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v81, v83, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v82, v82, v9, 0x7fff
@@ -113086,8 +113122,8 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v81, v81, v83, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v86, 0x400000, v83
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v6, v6, 16, v70
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v70, 16, v0
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v68, 16, v2
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v69, 16, v1
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v9, v82, v85, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v83, v83
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v84, 0x40c00000, v84
@@ -113102,49 +113138,48 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v87, 0x400000, v84
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v82, v82, v84, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v11, 0x40c00000, v11
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v2, v65, 16, v67
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v6, v6, 16, v70
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v10, v83, v86, vcc_lo
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v86, 16, v12
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v84, v84
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v85, 0x40c00000, v85
+; GFX11-TRUE16-NEXT:    v_dual_add_f32 v85, 0x40c00000, v85 :: v_dual_lshlrev_b32 v86, 16, v12
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v84, v11, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v96, 0x400000, v11
-; GFX11-TRUE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v82, v82, v87, vcc_lo
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v83, v85, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v87, 0x400000, v85
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v85, v85
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v84, v84, v11, 0x7fff
-; GFX11-TRUE16-NEXT:    v_add_f32_e32 v12, 0x40c00000, v12
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v83, v83, v85, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v85, 16, v13
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v1, v64, 16, v68
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v0, v55, 16, v69
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v55, 16, v30
-; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v83, v83, v87, vcc_lo
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v70, 16, v0
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v2, v65, 16, v67
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT:    v_dual_add_f32 v12, 0x40c00000, v12 :: v_dual_cndmask_b32 v83, v83, v87
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v11, v11
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v86, 0x40c00000, v86 :: v_dual_add_f32 v85, 0x40c00000, v85
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v64, 16, v29
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v65, 16, v28
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v1, v64, 16, v68
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v0, v55, 16, v69
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v11, v84, v96, vcc_lo
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v84, v86, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v87, 0x400000, v86
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v96, v12, 16, 1
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v86, v86
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v3, v3, 16, v66
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v55, 16, v30
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v84, v84, v86, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v86, v96, v12, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v96, v85, 16, 1
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 16, v27
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v64, 16, v29
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v84, v84, v87, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v87, 0x400000, v12
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v13, 0x40c00000, v13
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v29, v52, 16, v55
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v28, v51, 16, v64
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v27, v50, 16, v65
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v65, 16, v28
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v3, v3, 16, v66
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 16, v27
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v12, v86, v87, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v97, v13, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v86, v96, v85, 0x7fff
@@ -113154,9 +113189,9 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v97, v97, v13, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v98, 16, v15
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v50, 16, v25
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v51, 16, v24
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v52, 16, v23
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v29, v52, 16, v55
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v28, v51, 16, v64
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v27, v50, 16, v65
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_dual_add_f32 v98, 0x40c00000, v98 :: v_dual_add_f32 v15, 0x40c00000, v15
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v99, v96, 16, 1
@@ -113166,9 +113201,9 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v112, v15, 16, 1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v99, v99, v96, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v113, 0x400000, v98
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v7.l, v7.h
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v8.l, v8.h
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v9.l, v9.h
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v50, 16, v25
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v51, 16, v24
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v52, 16, v23
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v96, v99, v101, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v99, v102, v14, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v101, 0x400000, v14
@@ -113179,34 +113214,37 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v96, 16, v96
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v14, v99, v101, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v10.l, v10.h
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v83.l, v83.h
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v84, 16, v84
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v7.l, v7.h
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v8.l, v8.h
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v9.l, v9.h
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v12.l, v14.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v15, v103, v112, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v98, v98
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v10.l, v10.h
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v83.l, v83.h
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v84, 16, v84
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
+; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v98, v102, v113, vcc_lo
+; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v82, 16, v82
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v81, 16, v81
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v80, 16, v80
-; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v98, v102, v113, vcc_lo
-; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v71, 16, v71
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v30, v53, 16, v54
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v53, 16, v22
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v14, 16, v98
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v13, v97, v100, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v85, v85
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v13.l, v15.h
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v25, v48, 16, v49
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v24, v39, 16, v50
-; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v23, v38, 16, v51
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v71, 16, v71
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v30, v53, 16, v54
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v53, 16, v22
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v85, v86, v87, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v86.l, v12.h
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v87.l, v13.h
 ; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v15, v13, 16, v14
 ; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v14, v12, 16, v96
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v12, 16, v85
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v25, v48, 16, v49
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v24, v39, 16, v50
+; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v23, v38, 16, v51
 ; GFX11-TRUE16-NEXT:    v_lshl_or_b32 v22, v37, 16, v52
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v37, 16, v20
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v38, 16, v19
@@ -113293,15 +113331,15 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v33, v34, v36, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v34, 0x400000, v36
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v37, v17, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v35, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v36, v36
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v36, 0x400000, v35
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v33, v33, v34, vcc_lo
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v39, 0x40c00000, v18
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v34, v37, v35, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v37, v38, 16, 1
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v39, 0x40c00000, v18
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v17, v33, v17, 0x7060302
@@ -113508,14 +113546,12 @@ define <64 x i16> @bitcast_v64bf16_to_v64i16(<64 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v0, 0x40c00000, v0 :: v_dual_lshlrev_b32 v67, 16, v1
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v30, v54, v30, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v31, v55, v64, vcc_lo
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v67, 0x40c00000, v67
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v55, v65, v66, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v64, 0x400000, v66
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v65, v68, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v66, v66
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v67, 0x40c00000, v67
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v66, v0, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v55, v55, v64, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v64, v65, v68, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v65, 0x400000, v68
@@ -113795,30 +113831,29 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:100
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v4
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
@@ -113861,55 +113896,58 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v30
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v39
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:28
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:12
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:136
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:128
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v38
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(4)
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:116
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v53
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v52
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v55
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v51
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v54
-; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v53
-; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v52
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v51
+; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v50
+; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v49
+; GCN-NEXT:    v_lshlrev_b32_e32 v38, 16, v38
+; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
 ; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v50
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v35
-; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:116
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v36
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:108
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:96
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v35
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:104
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v12
 ; GCN-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:124
 ; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:120
 ; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:112
-; GCN-NEXT:    s_waitcnt vmcnt(4)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v18
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v20
 ; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v16
-; GCN-NEXT:    ; implicit-def: $vgpr20
 ; GCN-NEXT:    ; implicit-def: $vgpr18
+; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
+; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr16
 ; GCN-NEXT:    ; kill: killed $vgpr16
 ; GCN-NEXT:    ; implicit-def: $vgpr16
@@ -113920,7 +113958,6 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr16
 ; GCN-NEXT:    ; implicit-def: $vgpr16
 ; GCN-NEXT:    ; kill: killed $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr47
 ; GCN-NEXT:    ; implicit-def: $vgpr56
@@ -113946,70 +113983,70 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr16
 ; GCN-NEXT:    ; kill: killed $vgpr16
 ; GCN-NEXT:    ; implicit-def: $vgpr16
+; GCN-NEXT:    ; implicit-def: $vgpr20
 ; GCN-NEXT:    ; implicit-def: $vgpr22
 ; GCN-NEXT:    ; implicit-def: $vgpr24
 ; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB52_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v19
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v21
 ; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v23
 ; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v25
 ; GCN-NEXT:    v_lshlrev_b32_e32 v57, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v29
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v6
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v45
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v43
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v43
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v42
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v42
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v41
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v41
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v40
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v34
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v33
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v32
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v31
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v37
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v33
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v10
 ; GCN-NEXT:    s_waitcnt vmcnt(12)
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v12
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v8
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr3
 ; GCN-NEXT:    ; implicit-def: $vgpr5
@@ -114025,30 +114062,30 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr25
 ; GCN-NEXT:    ; implicit-def: $vgpr27
 ; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr34
-; GCN-NEXT:    ; implicit-def: $vgpr6
 ; GCN-NEXT:    ; implicit-def: $vgpr4
 ; GCN-NEXT:    ; implicit-def: $vgpr2
+; GCN-NEXT:    ; implicit-def: $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr34
+; GCN-NEXT:    ; implicit-def: $vgpr33
 ; GCN-NEXT:    ; implicit-def: $vgpr32
-; GCN-NEXT:    ; implicit-def: $vgpr8
 ; GCN-NEXT:    ; implicit-def: $vgpr31
-; GCN-NEXT:    ; implicit-def: $vgpr37
-; GCN-NEXT:    ; implicit-def: $vgpr33
+; GCN-NEXT:    ; implicit-def: $vgpr6
+; GCN-NEXT:    ; implicit-def: $vgpr10
 ; GCN-NEXT:    ; implicit-def: $vgpr12
 ; GCN-NEXT:    ; implicit-def: $vgpr14
-; GCN-NEXT:    ; implicit-def: $vgpr10
+; GCN-NEXT:    ; implicit-def: $vgpr8
 ; GCN-NEXT:  .LBB52_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB52_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v10
-; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
-; GCN-NEXT:    v_or_b32_e32 v10, v55, v10
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v8
+; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
+; GCN-NEXT:    v_or_b32_e32 v8, v55, v8
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v14
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
@@ -114057,23 +114094,23 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 3, v12
 ; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
 ; GCN-NEXT:    v_or_b32_e32 v12, v53, v12
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v33
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
-; GCN-NEXT:    v_or_b32_e32 v16, v52, v16
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, 3, v10
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
+; GCN-NEXT:    v_or_b32_e32 v10, v52, v10
 ; GCN-NEXT:    s_mov_b32 s6, 0x30000
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v37
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v31
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v8
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v32
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v6
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v31
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v32
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v33
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v34
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v40
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v41
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v42
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v43
 ; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v45
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v2
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v4
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, 3, v6
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v34
 ; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
 ; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v25
@@ -114090,19 +114127,19 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v1
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xffff, v32
+; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v33
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
-; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
-; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v33
 ; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
@@ -114118,31 +114155,31 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_or_b32_e32 v18, v51, v18
-; GCN-NEXT:    v_or_b32_e32 v20, v50, v20
-; GCN-NEXT:    v_or_b32_e32 v8, v49, v8
-; GCN-NEXT:    v_or_b32_e32 v22, v48, v22
-; GCN-NEXT:    v_or_b32_e32 v24, v39, v24
-; GCN-NEXT:    v_or_b32_e32 v26, v38, v26
+; GCN-NEXT:    v_or_b32_e32 v6, v51, v6
+; GCN-NEXT:    v_or_b32_e32 v16, v50, v16
+; GCN-NEXT:    v_or_b32_e32 v18, v49, v18
+; GCN-NEXT:    v_or_b32_e32 v20, v48, v20
+; GCN-NEXT:    v_or_b32_e32 v22, v39, v22
+; GCN-NEXT:    v_or_b32_e32 v24, v38, v24
+; GCN-NEXT:    v_or_b32_e32 v26, v30, v26
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v30, v28
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v31, v31, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v2, v32, v2
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v32, v32, v33
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v2, v33, v2
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v32, v32, v33
+; GCN-NEXT:    v_or_b32_e32 v4, v33, v4
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v33, v29
@@ -114185,26 +114222,26 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v33, v3
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v33, v1
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, 0x30000, v10
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 0x30000, v8
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, s6, v14
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, s6, v12
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, s6, v16
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, s6, v18
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, s6, v20
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, s6, v8
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, s6, v10
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, s6, v6
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, s6, v16
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, s6, v18
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, s6, v20
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, s6, v22
 ; GCN-NEXT:    v_add_i32_e32 v24, vcc, s6, v24
 ; GCN-NEXT:    v_add_i32_e32 v26, vcc, s6, v26
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, s6, v28
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, s6, v28
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, s6, v30
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, s6, v31
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, s6, v32
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, s6, v4
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, s6, v6
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, s6, v32
 ; GCN-NEXT:    v_add_i32_e32 v29, vcc, s6, v29
 ; GCN-NEXT:    v_add_i32_e32 v27, vcc, s6, v27
 ; GCN-NEXT:    v_add_i32_e32 v25, vcc, s6, v25
@@ -114221,12 +114258,12 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, s6, v3
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff0000, v1
-; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v1
+; GCN-NEXT:    buffer_store_dword v18, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v3
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v5
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
@@ -114238,37 +114275,37 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v9
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v9
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v11
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v13
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v15
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v17
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v19
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v19
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v19
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v21
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
@@ -114290,78 +114327,78 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v32
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v4
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v4
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v6
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v2
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v2
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v4
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v32
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v4
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v32
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v2
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v31
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v31
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v31
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v30
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v31
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v30
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff0000, v34
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v34
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v28
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff0000, v26
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v26
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:312 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v26
+; GCN-NEXT:    v_and_b32_e32 v38, 0xffff0000, v24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v24
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v24
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff0000, v22
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v22
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff0000, v20
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v20
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v8
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff0000, v33
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v33
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v34
+; GCN-NEXT:    v_and_b32_e32 v50, 0xffff0000, v16
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v34
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 16, v16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:332 ; 4-byte Folded Spill
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v33
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff0000, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v6
+; GCN-NEXT:    v_and_b32_e32 v52, 0xffff0000, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v10
 ; GCN-NEXT:    v_and_b32_e32 v53, 0xffff0000, v12
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v12
 ; GCN-NEXT:    v_and_b32_e32 v54, 0xffff0000, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v14
-; GCN-NEXT:    v_and_b32_e32 v55, 0xffff0000, v10
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v14
+; GCN-NEXT:    v_and_b32_e32 v55, 0xffff0000, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v8
 ; GCN-NEXT:  .LBB52_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v20
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v18
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
 ; GCN-NEXT:    v_alignbit_b32 v1, v1, v2, 16
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v18
+; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v37
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_alignbit_b32 v2, v2, v3, 16
 ; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
@@ -114374,36 +114411,34 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v35
 ; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GCN-NEXT:    v_alignbit_b32 v1, v1, v2, 16
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v45, v1, v2, 16
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v36
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v44, v2, v3, 16
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 8, v0
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v36
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_alignbit_b32 v44, v1, v2, 16
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 8, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v28
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_alignbit_b32 v15, v1, v2, 16
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 12, v0
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v15, v2, v3, 16
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 12, v0
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
-; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v11, v2, v3, 16
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 16, v0
+; GCN-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GCN-NEXT:    v_alignbit_b32 v1, v1, v2, 16
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 16, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
@@ -114414,7 +114449,7 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v3, 1.0, v3
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
@@ -114423,7 +114458,7 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
@@ -114432,9 +114467,11 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v45
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v4
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v45, v2, v4, 16
+; GCN-NEXT:    v_alignbit_b32 v11, v2, v4, 16
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 32, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -114456,48 +114493,48 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v56
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_alignbit_b32 v56, v2, v4, 16
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 44, v0
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 44, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v57
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_alignbit_b32 v57, v2, v4, 16
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 48, v0
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 48, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v58
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_alignbit_b32 v58, v2, v4, 16
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 52, v0
+; GCN-NEXT:    v_add_i32_e32 v35, vcc, 52, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v59
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v34, v2, v4, 16
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, 56, v0
+; GCN-NEXT:    v_alignbit_b32 v28, v2, v4, 16
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 56, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v60
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v33, v2, v4, 16
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 60, v0
+; GCN-NEXT:    v_alignbit_b32 v32, v2, v4, 16
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 60, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v61
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v32, v2, v4, 16
-; GCN-NEXT:    v_add_i32_e32 v37, vcc, 64, v0
+; GCN-NEXT:    v_alignbit_b32 v31, v2, v4, 16
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 64, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    v_mul_f32_e32 v4, 1.0, v62
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_alignbit_b32 v31, v2, v4, 16
+; GCN-NEXT:    v_alignbit_b32 v34, v2, v4, 16
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x44, v0
 ; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -114506,7 +114543,7 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_alignbit_b32 v41, v2, v4, 16
 ; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x48, v0
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_mul_f32_e32 v2, 1.0, v2
 ; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
@@ -114563,46 +114600,46 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_alignbit_b32 v16, v51, v16, 16
 ; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x68, v0
 ; GCN-NEXT:    v_mul_f32_e32 v52, 1.0, v52
-; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v22
+; GCN-NEXT:    v_mul_f32_e32 v20, 1.0, v20
 ; GCN-NEXT:    v_lshrrev_b32_e32 v52, 16, v52
-; GCN-NEXT:    v_alignbit_b32 v22, v52, v22, 16
+; GCN-NEXT:    v_alignbit_b32 v20, v52, v20, 16
 ; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x6c, v0
 ; GCN-NEXT:    v_mul_f32_e32 v53, 1.0, v53
-; GCN-NEXT:    v_mul_f32_e32 v24, 1.0, v24
+; GCN-NEXT:    v_mul_f32_e32 v22, 1.0, v22
 ; GCN-NEXT:    v_lshrrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_alignbit_b32 v24, v53, v24, 16
+; GCN-NEXT:    v_alignbit_b32 v22, v53, v22, 16
 ; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x70, v0
 ; GCN-NEXT:    v_mul_f32_e32 v54, 1.0, v54
-; GCN-NEXT:    v_mul_f32_e32 v26, 1.0, v26
+; GCN-NEXT:    v_mul_f32_e32 v24, 1.0, v24
 ; GCN-NEXT:    v_lshrrev_b32_e32 v54, 16, v54
-; GCN-NEXT:    v_alignbit_b32 v26, v54, v26, 16
+; GCN-NEXT:    v_alignbit_b32 v24, v54, v24, 16
 ; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x74, v0
 ; GCN-NEXT:    v_mul_f32_e32 v55, 1.0, v55
-; GCN-NEXT:    v_mul_f32_e32 v28, 1.0, v28
+; GCN-NEXT:    v_mul_f32_e32 v26, 1.0, v26
 ; GCN-NEXT:    v_lshrrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    v_alignbit_b32 v28, v55, v28, 16
+; GCN-NEXT:    v_alignbit_b32 v26, v55, v26, 16
 ; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
-; GCN-NEXT:    buffer_store_dword v1, v27, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v44, v23, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v15, v20, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v11, v18, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(3)
+; GCN-NEXT:    buffer_store_dword v45, v29, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v44, v25, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v15, v21, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v1, v18, s[0:3], 0 offen
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v1, v5, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v3, v9, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v7, v13, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v45, v17, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v11, v17, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v46, v19, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v47, v21, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v56, v25, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v57, v29, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v58, v35, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v34, v36, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v33, v37, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v32, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v31, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v47, v23, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v56, v27, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v57, v35, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v58, v36, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v37, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v32, v33, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v31, v40, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v34, v42, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v41, v43, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v2, v30, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v4, v38, s[0:3], 0 offen
@@ -114612,10 +114649,10 @@ define <64 x bfloat> @bitcast_v64i16_to_v64bf16(<64 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v12, v50, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v14, v51, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v16, v52, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v22, v53, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v24, v54, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v26, v55, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v28, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v20, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v54, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v24, v55, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v26, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Reload
@@ -114888,679 +114925,656 @@ define <64 x i16> @bitcast_v64f16_to_v64i16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:136
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:52
+; GCN-NEXT:    s_waitcnt expcnt(5)
+; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:48
 ; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:48
-; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:40
+; GCN-NEXT:    s_waitcnt expcnt(2)
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:36
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v3
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v7
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v10
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v12
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:32
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v15
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v9
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v10
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v11
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v13
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v14
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v15
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v19
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v20
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v18
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v20
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v22
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v23
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v24
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v24
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v26
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v27
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v28
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v29
+; GCN-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v30
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v29
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v30
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v45
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:120
-; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v41
-; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v53
-; GCN-NEXT:    s_waitcnt vmcnt(2) expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v15
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v13
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v12
-; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v42
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:136
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v13
+; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v11
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v54
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v52
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v51
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v59
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v49
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v48
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v39
+; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v36
+; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v57
+; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v63
+; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v62
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v61
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v43
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v40
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:112
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v55
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
+; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v59
+; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v58
+; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v56
+; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v45
+; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v46
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v42
+; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v41
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v50
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v11
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v12
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v9
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v13
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v29
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:132
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v30
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v9
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
+; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v14
+; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v49
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:132
+; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v35
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v11
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v13
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v30
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_or_saveexec_b64 s[4:5], s[4:5]
-; GCN-NEXT:    v_mov_b32_e32 v42, v56
-; GCN-NEXT:    v_mov_b32_e32 v49, v57
-; GCN-NEXT:    v_mov_b32_e32 v54, v58
-; GCN-NEXT:    v_mov_b32_e32 v51, v62
-; GCN-NEXT:    v_mov_b32_e32 v48, v4
-; GCN-NEXT:    v_mov_b32_e32 v36, v5
-; GCN-NEXT:    v_mov_b32_e32 v46, v6
+; GCN-NEXT:    v_mov_b32_e32 v49, v54
+; GCN-NEXT:    v_mov_b32_e32 v48, v37
+; GCN-NEXT:    v_mov_b32_e32 v38, v53
 ; GCN-NEXT:    s_xor_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB53_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.true
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
-; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v12
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v29
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
 ; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v13
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
+; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
-; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v17
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v14
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
+; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
+; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v19
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v18
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v30
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
+; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
+; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v17
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
+; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
+; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v20
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
-; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
 ; GCN-NEXT:    v_add_f32_e32 v22, 0x38000000, v22
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v22
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
 ; GCN-NEXT:    v_add_f32_e32 v25, 0x38000000, v25
 ; GCN-NEXT:    v_add_f32_e32 v24, 0x38000000, v24
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v25
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v24, v24, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v25
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
 ; GCN-NEXT:    v_add_f32_e32 v27, 0x38000000, v27
 ; GCN-NEXT:    v_add_f32_e32 v26, 0x38000000, v26
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v27
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v26
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v26, v26, v29
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v27
+; GCN-NEXT:    v_or_b32_e32 v26, v26, v30
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v29
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
+; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v29
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
 ; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v29
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v2
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v30
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
 ; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
 ; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v29
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v30
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v4, v5, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v6
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
-; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v10
+; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
 ; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v5, v7, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v31
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v46
-; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v30
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
+; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v29
+; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v31
-; GCN-NEXT:    v_or_b32_e32 v6, v10, v29
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v33
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v10
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v32
-; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v29
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v33
-; GCN-NEXT:    v_or_b32_e32 v32, v29, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v35
+; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v31
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v32
+; GCN-NEXT:    v_or_b32_e32 v31, v30, v31
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v34
-; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v29
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v35
-; GCN-NEXT:    v_or_b32_e32 v34, v29, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v63
+; GCN-NEXT:    v_add_f32_e32 v33, 0x38000000, v33
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v33
+; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v33, v30, v33
+; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v51
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v61
+; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v55
+; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v1
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v1
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v49
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v51
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v42, v42
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v52, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v58, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v61, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v1
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v63, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v40
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v62, v53
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v45
+; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v60, v60
-; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v41
-; GCN-NEXT:    v_cvt_f32_f16_e32 v59, v59
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v63, v63
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v57
+; GCN-NEXT:    v_cvt_f32_f16_e32 v62, v62
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v47, v47
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v50
+; GCN-NEXT:    v_cvt_f32_f16_e32 v61, v61
+; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v45
+; GCN-NEXT:    v_cvt_f32_f16_e32 v59, v59
+; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v44
+; GCN-NEXT:    v_cvt_f32_f16_e32 v58, v58
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v43, v43
-; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v39
-; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v40
-; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v55
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v56
+; GCN-NEXT:    v_cvt_f32_f16_e32 v42, v42
+; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v46
+; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v41
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
+; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
 ; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
 ; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
+; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
 ; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
+; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
+; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
 ; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
 ; GCN-NEXT:    v_add_f32_e32 v52, 0x38000000, v52
+; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
 ; GCN-NEXT:    v_add_f32_e32 v54, 0x38000000, v54
-; GCN-NEXT:    v_add_f32_e32 v42, 0x38000000, v42
-; GCN-NEXT:    v_add_f32_e32 v44, 0x38000000, v44
-; GCN-NEXT:    v_add_f32_e32 v46, 0x38000000, v46
-; GCN-NEXT:    v_add_f32_e32 v56, 0x38000000, v56
-; GCN-NEXT:    v_add_f32_e32 v58, 0x38000000, v58
-; GCN-NEXT:    v_add_f32_e32 v61, 0x38000000, v61
+; GCN-NEXT:    v_add_f32_e32 v55, 0x38000000, v55
+; GCN-NEXT:    v_add_f32_e32 v40, 0x38000000, v40
+; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v60, 0x38000000, v60
 ; GCN-NEXT:    v_add_f32_e32 v63, 0x38000000, v63
-; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
 ; GCN-NEXT:    v_add_f32_e32 v57, 0x38000000, v57
 ; GCN-NEXT:    v_add_f32_e32 v62, 0x38000000, v62
+; GCN-NEXT:    v_add_f32_e32 v47, 0x38000000, v47
+; GCN-NEXT:    v_add_f32_e32 v61, 0x38000000, v61
 ; GCN-NEXT:    v_add_f32_e32 v45, 0x38000000, v45
-; GCN-NEXT:    v_add_f32_e32 v60, 0x38000000, v60
-; GCN-NEXT:    v_add_f32_e32 v41, 0x38000000, v41
 ; GCN-NEXT:    v_add_f32_e32 v59, 0x38000000, v59
-; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
-; GCN-NEXT:    v_add_f32_e32 v47, 0x38000000, v47
-; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
+; GCN-NEXT:    v_add_f32_e32 v44, 0x38000000, v44
+; GCN-NEXT:    v_add_f32_e32 v58, 0x38000000, v58
 ; GCN-NEXT:    v_add_f32_e32 v43, 0x38000000, v43
-; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
-; GCN-NEXT:    v_add_f32_e32 v40, 0x38000000, v40
-; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
-; GCN-NEXT:    v_add_f32_e32 v55, 0x38000000, v55
-; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
+; GCN-NEXT:    v_add_f32_e32 v56, 0x38000000, v56
+; GCN-NEXT:    v_add_f32_e32 v42, 0x38000000, v42
+; GCN-NEXT:    v_add_f32_e32 v46, 0x38000000, v46
+; GCN-NEXT:    v_add_f32_e32 v41, 0x38000000, v41
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v49
+; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v53
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v54
-; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v42
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v56
-; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v58
-; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v61
+; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v55
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v40
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v63, v63
-; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v57
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v62
+; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v47
+; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v61
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
-; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v41
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v59
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v53
-; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v58
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v43
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v40
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
-; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v55
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v56
+; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v42
+; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v46
+; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v41
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
 ; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v36
-; GCN-NEXT:    v_lshlrev_b32_e32 v48, 16, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v51
+; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
+; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v52
 ; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v54
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v61
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v63, 16, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v62, 16, v62
-; GCN-NEXT:    v_lshlrev_b32_e32 v60, 16, v60
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v61
 ; GCN-NEXT:    v_lshlrev_b32_e32 v59, 16, v59
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v29
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v61, v38, v36
-; GCN-NEXT:    v_or_b32_e32 v49, v49, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
+; GCN-NEXT:    v_or_b32_e32 v35, v35, v30
+; GCN-NEXT:    buffer_store_dword v35, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v38, v38, v36
+; GCN-NEXT:    v_or_b32_e32 v48, v48, v39
+; GCN-NEXT:    v_or_b32_e32 v49, v49, v37
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v52, v51
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v42, v42, v54
+; GCN-NEXT:    v_or_b32_e32 v35, v51, v50
+; GCN-NEXT:    buffer_store_dword v35, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v44
-; GCN-NEXT:    v_mov_b32_e32 v46, v6
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v35, v53, v52
+; GCN-NEXT:    buffer_store_dword v35, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v58, v56
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v35, v55, v54
+; GCN-NEXT:    buffer_store_dword v35, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v40
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v63, v2
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v60, v60, v23
+; GCN-NEXT:    v_or_b32_e32 v57, v57, v63
+; GCN-NEXT:    v_or_b32_e32 v47, v47, v62
+; GCN-NEXT:    v_or_b32_e32 v45, v45, v61
+; GCN-NEXT:    v_or_b32_e32 v44, v44, v59
+; GCN-NEXT:    v_or_b32_e32 v43, v43, v58
+; GCN-NEXT:    v_or_b32_e32 v42, v42, v56
+; GCN-NEXT:    v_or_b32_e32 v41, v41, v46
+; GCN-NEXT:    v_alignbit_b32 v51, v33, v30, 16
+; GCN-NEXT:    v_alignbit_b32 v55, v31, v36, 16
+; GCN-NEXT:    v_alignbit_b32 v30, v9, v39, 16
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v57, v28
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v45, v45, v62
-; GCN-NEXT:    v_or_b32_e32 v41, v41, v60
+; GCN-NEXT:    v_alignbit_b32 v30, v7, v37, 16
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v30, v53, v59
-; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v50, v50, v47
-; GCN-NEXT:    v_or_b32_e32 v39, v39, v43
-; GCN-NEXT:    v_or_b32_e32 v37, v37, v40
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v55
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v30, v5, v50, 16
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    v_alignbit_b32 v63, v34, v29, 16
-; GCN-NEXT:    v_alignbit_b32 v36, v32, v36, 16
-; GCN-NEXT:    v_alignbit_b32 v48, v46, v48, 16
-; GCN-NEXT:    v_mov_b32_e32 v10, v5
-; GCN-NEXT:    v_alignbit_b32 v51, v5, v51, 16
-; GCN-NEXT:    v_mov_b32_e32 v7, v4
-; GCN-NEXT:    v_alignbit_b32 v54, v4, v54, 16
-; GCN-NEXT:    v_alignbit_b32 v29, v3, v44, 16
-; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(1) expcnt(0)
-; GCN-NEXT:    v_alignbit_b32 v29, v1, v56, 16
-; GCN-NEXT:    buffer_store_dword v29, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v2, v26, v2, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v28, v24, v28, 16
-; GCN-NEXT:    v_alignbit_b32 v53, v22, v62, 16
-; GCN-NEXT:    v_alignbit_b32 v60, v20, v60, 16
-; GCN-NEXT:    v_alignbit_b32 v59, v18, v59, 16
-; GCN-NEXT:    v_alignbit_b32 v47, v16, v47, 16
-; GCN-NEXT:    v_alignbit_b32 v43, v14, v43, 16
-; GCN-NEXT:    v_alignbit_b32 v40, v11, v40, 16
-; GCN-NEXT:    v_alignbit_b32 v55, v9, v55, 16
+; GCN-NEXT:    v_alignbit_b32 v30, v3, v52, 16
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(4) expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v30, v1, v54, 16
+; GCN-NEXT:    buffer_store_dword v30, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v40, v28, v40, 16
+; GCN-NEXT:    v_alignbit_b32 v23, v26, v23, 16
+; GCN-NEXT:    v_alignbit_b32 v63, v24, v63, 16
+; GCN-NEXT:    v_alignbit_b32 v62, v21, v62, 16
+; GCN-NEXT:    v_alignbit_b32 v61, v19, v61, 16
+; GCN-NEXT:    v_alignbit_b32 v59, v15, v59, 16
+; GCN-NEXT:    v_alignbit_b32 v58, v16, v58, 16
+; GCN-NEXT:    v_alignbit_b32 v56, v12, v56, 16
+; GCN-NEXT:    v_alignbit_b32 v46, v11, v46, 16
 ; GCN-NEXT:  .LBB53_2: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT:    v_mov_b32_e32 v30, v1
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    v_mov_b32_e32 v36, v1
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v63
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v35
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v51
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v30
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v33
+; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v33
 ; GCN-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 4, v0
-; GCN-NEXT:    buffer_store_dword v2, v1, s[0:3], 0 offen
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v61
+; GCN-NEXT:    buffer_store_dword v30, v1, s[0:3], 0 offen
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v38
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v36
-; GCN-NEXT:    v_or_b32_e32 v56, v1, v2
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v32
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v33
-; GCN-NEXT:    v_or_b32_e32 v44, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, 8, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v55
+; GCN-NEXT:    v_or_b32_e32 v30, v1, v30
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v31
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v32
+; GCN-NEXT:    v_mov_b32_e32 v39, v40
+; GCN-NEXT:    v_or_b32_e32 v40, v1, v31
+; GCN-NEXT:    v_add_i32_e32 v52, vcc, 8, v0
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v48
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v48, v1, v31
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, 12, v0
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v10
+; GCN-NEXT:    v_or_b32_e32 v9, v1, v9
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 16, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v48
-; GCN-NEXT:    v_or_b32_e32 v61, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v52, vcc, 12, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v46
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v31
-; GCN-NEXT:    v_or_b32_e32 v58, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v63, vcc, 16, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v51
-; GCN-NEXT:    v_or_b32_e32 v57, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v46, vcc, 20, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v10
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v10, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 24, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v42
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v54
-; GCN-NEXT:    v_or_b32_e32 v31, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 28, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v10
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v38, vcc, 20, v0
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v7
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v7
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 24, v0
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v7, v1, v2
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
+; GCN-NEXT:    v_or_b32_e32 v10, v1, v7
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 28, v0
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v6
+; GCN-NEXT:    v_or_b32_e32 v7, v1, v5
 ; GCN-NEXT:    v_add_i32_e32 v6, vcc, 32, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v33, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 36, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
+; GCN-NEXT:    v_or_b32_e32 v32, v1, v5
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 36, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v5, v1, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v5, v1, v3
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 40, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v35, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 44, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v30
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v34, v1, v3
+; GCN-NEXT:    v_add_i32_e32 v35, vcc, 44, v0
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v36
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GCN-NEXT:    v_or_b32_e32 v3, v1, v2
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 48, v0
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v30
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, 52, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v39
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v36
+; GCN-NEXT:    v_add_i32_e32 v36, vcc, 52, v0
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 56, v0
+; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v60
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v37, v23
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 60, v0
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v27
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 56, v0
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v28
-; GCN-NEXT:    v_or_b32_e32 v28, v30, v28
-; GCN-NEXT:    v_add_i32_e32 v48, vcc, 60, v0
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 64, v0
+; GCN-NEXT:    v_and_b32_e32 v39, 0xffff, v57
+; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v63
+; GCN-NEXT:    v_or_b32_e32 v57, v39, v49
+; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x44, v0
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
 ; GCN-NEXT:    v_or_b32_e32 v24, v24, v25
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 64, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v45
-; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v53
-; GCN-NEXT:    v_or_b32_e32 v49, v30, v49
-; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x44, v0
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v23
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 0x48, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v41
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v60
-; GCN-NEXT:    v_or_b32_e32 v45, v30, v54
-; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x4c, v0
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v21
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 0x50, v0
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v59
-; GCN-NEXT:    v_or_b32_e32 v53, v30, v53
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x54, v0
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v19
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 0x58, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v50
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v47
-; GCN-NEXT:    v_or_b32_e32 v50, v30, v50
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x5c, v0
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 0x48, v0
+; GCN-NEXT:    v_and_b32_e32 v49, 0xffff, v47
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v62
+; GCN-NEXT:    v_or_b32_e32 v47, v49, v51
+; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x4c, v0
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v22
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 0x50, v0
+; GCN-NEXT:    v_and_b32_e32 v51, 0xffff, v45
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v61
+; GCN-NEXT:    v_or_b32_e32 v60, v51, v53
+; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x54, v0
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v20
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 0x58, v0
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v59
+; GCN-NEXT:    v_or_b32_e32 v59, v53, v55
+; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x5c, v0
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v17
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v17
 ; GCN-NEXT:    v_add_i32_e32 v17, vcc, 0x60, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v39
-; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v43
-; GCN-NEXT:    v_or_b32_e32 v39, v30, v39
-; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x64, v0
-; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
-; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v15
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, 0x68, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v37
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v40
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v37
-; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x6c, v0
+; GCN-NEXT:    v_and_b32_e32 v55, 0xffff, v43
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v58
+; GCN-NEXT:    v_or_b32_e32 v45, v55, v43
+; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x64, v0
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v18
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 0x68, v0
+; GCN-NEXT:    v_and_b32_e32 v42, 0xffff, v42
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v56
+; GCN-NEXT:    v_or_b32_e32 v44, v42, v43
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x6c, v0
+; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
+; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
+; GCN-NEXT:    v_or_b32_e32 v12, v12, v14
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, 0x70, v0
+; GCN-NEXT:    v_and_b32_e32 v41, 0xffff, v41
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v46
+; GCN-NEXT:    v_or_b32_e32 v41, v41, v43
+; GCN-NEXT:    v_add_i32_e32 v43, vcc, 0x74, v0
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GCN-NEXT:    v_or_b32_e32 v11, v11, v13
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 0x70, v0
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
-; GCN-NEXT:    v_or_b32_e32 v37, v37, v55
-; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x74, v0
-; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v12
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, 0x78, v0
+; GCN-NEXT:    v_add_i32_e32 v13, vcc, 0x78, v0
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x7c, v0
-; GCN-NEXT:    buffer_store_dword v56, v29, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v44, v52, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v61, v63, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v58, v46, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v57, v8, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v10, v32, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v31, v6, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v7, v34, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v33, v4, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v5, v36, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v35, v2, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v3, v38, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v1, v27, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v26, v48, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v28, v25, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v24, v51, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v49, v23, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v22, v54, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v45, v21, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v20, v41, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v53, v19, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v18, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v50, v17, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v16, v43, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v39, v15, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v14, v40, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v30, v13, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v11, v55, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v37, v12, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v30, v52, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v40, v50, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v48, v54, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v9, v38, s[0:3], 0 offen
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v9, v8, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    buffer_store_dword v8, v31, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v10, v6, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v7, v33, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v32, v4, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v5, v35, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v34, v2, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v36, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v1, v29, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v37, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v23, v27, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v26, v39, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v57, v25, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v24, v49, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v47, v22, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v21, v51, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v60, v20, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v19, v53, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v59, v17, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v15, v55, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v45, v18, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v16, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v44, v14, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v12, v43, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v41, v13, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:140 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:144 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:148 ; 4-byte Folded Reload
@@ -115571,10 +115585,13 @@ define <64 x i16> @bitcast_v64f16_to_v64i16(<64 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:168 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:172 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:176 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(5)
 ; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:180 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(3)
 ; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:184 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(1)
 ; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
index b040e77125770..fa9772f9702bb 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
@@ -9531,22 +9531,23 @@ define <8 x i16> @bitcast_v8bf16_to_v8i16(<8 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v5, v11, v12 :: v_dual_add_f32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v7, v13, v1, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v3
-; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v11, v6, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v9, v9, v0, 0x7fff
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v1, v7, v8 :: v_dual_add_f32 v2, 0x40c00000, v2
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v7, 0x40c00000, v12
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v8, v11, v6, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v11, 0x400000, v6
-; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v12, v2, 16, 1
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v13, v7, 16, 1
+; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v14, v3, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v15, 0x400000, v2
+; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v5, v1, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v6, v8, v11, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v8, v12, v2, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v11, v13, v7, 0x7fff
@@ -9554,18 +9555,18 @@ define <8 x i16> @bitcast_v8bf16_to_v8i16(<8 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v13, v14, v3, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v14, 0x400000, v3
-; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v5, v1, 0x7060302
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v7, v11, v12, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v3, v13, v14, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v3, v7, v3, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v2, v8, v15, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v2, v6, v2, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v9, v10, vcc_lo
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v0, v4, v0, 0x7060302
 ; GFX11-FAKE16-NEXT:  .LBB47_2: ; %end
 ; GFX11-FAKE16-NEXT:    s_or_b32 exec_lo, exec_lo, s0
@@ -11090,22 +11091,23 @@ define <8 x half> @bitcast_v8bf16_to_v8f16(<8 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v5, v11, v12 :: v_dual_add_f32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v7, v13, v1, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v3
-; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v11, v6, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v9, v9, v0, 0x7fff
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v1, v7, v8 :: v_dual_add_f32 v2, 0x40c00000, v2
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v7, 0x40c00000, v12
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v8, v11, v6, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v11, 0x400000, v6
-; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v12, v2, 16, 1
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v13, v7, 16, 1
+; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v14, v3, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v15, 0x400000, v2
+; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v5, v1, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v6, v8, v11, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v8, v12, v2, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v11, v13, v7, 0x7fff
@@ -11113,18 +11115,18 @@ define <8 x half> @bitcast_v8bf16_to_v8f16(<8 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v13, v14, v3, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v14, 0x400000, v3
-; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v5, v1, 0x7060302
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v7, v11, v12, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v3, v13, v14, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v3, v7, v3, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v2, v8, v15, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v2, v6, v2, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v9, v10, vcc_lo
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v0, v4, v0, 0x7060302
 ; GFX11-FAKE16-NEXT:  .LBB51_2: ; %end
 ; GFX11-FAKE16-NEXT:    s_or_b32 exec_lo, exec_lo, s0
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
index edeb780d481c4..0b5274e6b5050 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
@@ -20545,19 +20545,19 @@ define <32 x i8> @bitcast_v16bf16_to_v32i8(<16 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v9, v15, v10, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v15, 0x400000, v10
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v10, v19, v14, 0x7fff
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v20, v9, v15 :: v_dual_add_f32 v9, 0x40c00000, v17
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v17, 0xffff0000, v33
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v18, 0x400000, v12
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v15, 0x400000, v14
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v12, v16, v18, vcc_lo
+; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v12, v16, v18 :: v_dual_and_b32 v17, 0xffff0000, v33
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v14, v14
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v16, v9, 16, 1
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v18, 16, v32
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v14, v10, v15 :: v_dual_add_f32 v15, 0x40c00000, v17
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v17, 0x40c00000, v18
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v16, v16, v9, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v18, 0x400000, v9
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
index 6e6e62c4b05ad..61ec728557172 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
@@ -2005,13 +2005,13 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_mov_b32_e32 v33, v4
 ; VI-NEXT:    v_mov_b32_e32 v32, v2
 ; VI-NEXT:    v_mov_b32_e32 v31, v0
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; VI-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:16
 ; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:20
 ; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:4
@@ -2031,18 +2031,18 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
 ; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v15, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v2
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v4
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v11, 8, v4
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b16_e32 v11, 8, v8
-; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
+; VI-NEXT:    v_lshlrev_b16_e32 v13, 8, v6
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v13, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v15, 8, v44
+; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB5_2
@@ -2217,13 +2217,13 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_mov_b32_e32 v33, v4
 ; GFX9-NEXT:    v_mov_b32_e32 v32, v2
 ; GFX9-NEXT:    v_mov_b32_e32 v31, v0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; GFX9-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:16
 ; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:20
 ; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:4
@@ -2243,18 +2243,18 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v15, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v2
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v11, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v11, 8, v8
-; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
+; GFX9-NEXT:    v_lshlrev_b16_e32 v13, 8, v6
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v13, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v15, 8, v44
+; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB5_2
@@ -5205,13 +5205,13 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_mov_b32_e32 v33, v4
 ; VI-NEXT:    v_mov_b32_e32 v32, v2
 ; VI-NEXT:    v_mov_b32_e32 v31, v0
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; VI-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:16
 ; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:20
 ; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:4
@@ -5231,18 +5231,18 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
 ; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v15, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v2
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v4
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v11, 8, v4
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b16_e32 v11, 8, v8
-; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
+; VI-NEXT:    v_lshlrev_b16_e32 v13, 8, v6
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v13, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v15, 8, v44
+; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB13_2
@@ -5417,13 +5417,13 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_mov_b32_e32 v33, v4
 ; GFX9-NEXT:    v_mov_b32_e32 v32, v2
 ; GFX9-NEXT:    v_mov_b32_e32 v31, v0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; GFX9-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:16
 ; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:20
 ; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:4
@@ -5443,18 +5443,18 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v15, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v2
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v11, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v11, 8, v8
-; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
+; GFX9-NEXT:    v_lshlrev_b16_e32 v13, 8, v6
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v13, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v15, 8, v44
+; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB13_2
@@ -8098,13 +8098,13 @@ define <20 x half> @bitcast_v40i8_to_v20f16(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_mov_b32_e32 v38, v4
 ; VI-NEXT:    v_mov_b32_e32 v32, v2
 ; VI-NEXT:    v_mov_b32_e32 v36, v0
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:16
 ; VI-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:20
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:4
@@ -8126,17 +8126,17 @@ define <20 x half> @bitcast_v40i8_to_v20f16(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v27
 ; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v44, 8, v2
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v4
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v4
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v8
+; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v6
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v44, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
+; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -8321,13 +8321,13 @@ define <20 x half> @bitcast_v40i8_to_v20f16(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_mov_b32_e32 v35, v4
 ; GFX9-NEXT:    v_mov_b32_e32 v33, v2
 ; GFX9-NEXT:    v_mov_b32_e32 v36, v0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; GFX9-NEXT:    buffer_load_ushort v54, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:16
 ; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:20
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:4
@@ -8349,17 +8349,17 @@ define <20 x half> @bitcast_v40i8_to_v20f16(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v2
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
+; GFX9-NEXT:    v_lshlrev_b16_e32 v47, 8, v6
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v47, 8, v10
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v10
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -10300,13 +10300,13 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_mov_b32_e32 v33, v4
 ; VI-NEXT:    v_mov_b32_e32 v32, v2
 ; VI-NEXT:    v_mov_b32_e32 v31, v0
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:16
 ; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
 ; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:4
@@ -10328,17 +10328,17 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v27
 ; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v2
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v4
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v4
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
+; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v6
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v10
+; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v10
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -10523,13 +10523,13 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_mov_b32_e32 v33, v4
 ; GFX9-NEXT:    v_mov_b32_e32 v32, v2
 ; GFX9-NEXT:    v_mov_b32_e32 v31, v0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:16
 ; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
 ; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:4
@@ -10551,17 +10551,17 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v2
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
+; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v6
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v10
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v10
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -12619,13 +12619,13 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_mov_b32_e32 v33, v4
 ; VI-NEXT:    v_mov_b32_e32 v32, v2
 ; VI-NEXT:    v_mov_b32_e32 v31, v0
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:16
 ; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
 ; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:12
 ; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:4
@@ -12647,17 +12647,17 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v27
 ; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v0
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v2
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v4
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v4
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
+; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v6
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v10
+; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v10
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -12842,13 +12842,13 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_mov_b32_e32 v33, v4
 ; GFX9-NEXT:    v_mov_b32_e32 v32, v2
 ; GFX9-NEXT:    v_mov_b32_e32 v31, v0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:32
 ; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:16
 ; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
 ; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:12
 ; GFX9-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:4
@@ -12870,17 +12870,17 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v27
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v29
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v2
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v8
+; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v6
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v10
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v10
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
index c48a8459fdc3c..205620458bdac 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
@@ -3506,15 +3506,15 @@ define <16 x i32> @bitcast_v32bf16_to_v16i32(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v18, v22, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v18, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v17, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_add_f32 v19, 0x40c00000, v21
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v19, 0x40c00000, v21
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v22, v18, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v16, v19, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v20, 0x400000, v19
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
@@ -5370,109 +5370,109 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v32, v4
 ; GCN-NEXT:    v_mov_b32_e32 v35, v2
 ; GCN-NEXT:    v_mov_b32_e32 v31, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v52
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v1
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v17, 8, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 8, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v45
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 8, v44
-; GCN-NEXT:    s_waitcnt vmcnt(10)
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 24, v59
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v58
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 24, v57
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 24, v43
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v14
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v43
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v41
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v10
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v2
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v46, 24, v46
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v1
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v11
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v7
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -5480,13 +5480,13 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v31
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v42
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v40
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v32
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v41
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v55
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v33
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v40
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v54
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v34
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v55
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v53
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v35
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v36
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v37
@@ -5499,132 +5499,133 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v54
-; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v53
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v52
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v49
-; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v48
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v39
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v63
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v62
-; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v61
-; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
+; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v52
+; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v50
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v49
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v39
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v63
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v62
+; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v61
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v23
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v17, v33, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v34
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v33, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v26, v28, v29
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v46
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_or_b32_e32 v27, v28, v45
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v27, v35, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v29, v36, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v31, v38, v45
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v59
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v35
+; GCN-NEXT:    v_or_b32_e32 v30, v36, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_or_b32_e32 v32, v38, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v56
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v32
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v32
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v19, v7
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
+; GCN-NEXT:    v_or_b32_e32 v6, v19, v6
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v19, v5
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v4
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
+; GCN-NEXT:    v_or_b32_e32 v23, v23, v25
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v25, v22
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v26
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v27
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_or_b32_e32 v25, v59, v25
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v57, v30
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v32, v58, v8
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v28
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v30
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v32
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v32, v32, v8
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v9
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v8, v10
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v8, v12
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v8, v14
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v8, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v46, v19
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v8, v17
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v7
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v6
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v5
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v17, v21
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v23
 ; GCN-NEXT:    v_or_b32_e32 v5, v20, v22
-; GCN-NEXT:    v_or_b32_e32 v6, v23, v24
-; GCN-NEXT:    v_or_b32_e32 v7, v26, v28
-; GCN-NEXT:    v_or_b32_e32 v8, v27, v25
+; GCN-NEXT:    v_or_b32_e32 v6, v24, v25
+; GCN-NEXT:    v_or_b32_e32 v7, v26, v27
+; GCN-NEXT:    v_or_b32_e32 v8, v21, v28
 ; GCN-NEXT:    v_or_b32_e32 v9, v29, v30
 ; GCN-NEXT:    v_or_b32_e32 v10, v31, v32
 ; GCN-NEXT:    v_or_b32_e32 v11, v33, v34
 ; GCN-NEXT:    v_or_b32_e32 v12, v35, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v16
-; GCN-NEXT:    v_or_b32_e32 v15, v18, v19
+; GCN-NEXT:    v_or_b32_e32 v15, v18, v17
 ; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -5641,12 +5642,10 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr26
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr21
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr39
@@ -5654,57 +5653,59 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr27
 ; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; kill: killed $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; kill: killed $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; kill: killed $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:  .LBB13_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB13_4
@@ -5712,16 +5713,16 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v42, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v40, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v41, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v55, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v40, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v54, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v55, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v53, v3
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v35
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
@@ -5736,29 +5737,30 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v26
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v28
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v30
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v25
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v54
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v53
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v21
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v51
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v50
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v49
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v48
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v39
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v49
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v48
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v39
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v61
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v19
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v51
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v17
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v8
+; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v8
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
@@ -5769,59 +5771,51 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v4
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
+; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v6, v17, v35
+; GCN-NEXT:    v_or_b32_e32 v6, v56, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
+; GCN-NEXT:    v_or_b32_e32 v9, v47, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
+; GCN-NEXT:    v_or_b32_e32 v11, v46, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v13, v29, v14
+; GCN-NEXT:    v_or_b32_e32 v13, v45, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v15, v27, v16
+; GCN-NEXT:    v_or_b32_e32 v15, v59, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v18
-; GCN-NEXT:    v_or_b32_e32 v17, v44, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v20, v45, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v56, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v30
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v29, v43, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v19, v47, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v18, v58, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v21, v57, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v26, v41, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v25, v43, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_or_b32_e32 v27, v44, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v34
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -5851,29 +5845,37 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s7, v15
-; GCN-NEXT:    v_or_b32_e32 v16, v59, v16
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, s7, v17
-; GCN-NEXT:    v_or_b32_e32 v18, v57, v18
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
-; GCN-NEXT:    v_or_b32_e32 v22, v58, v22
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, s7, v23
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, s7, v18
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, s7, v27
+; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
+; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 0x300, v19
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v30
+; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, s7, v26
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, s7, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x300, v27
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
@@ -5883,13 +5885,13 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_or_b32_e32 v0, v31, v0
 ; GCN-NEXT:    v_or_b32_e32 v1, v8, v1
 ; GCN-NEXT:    v_or_b32_e32 v2, v5, v2
@@ -5899,13 +5901,13 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v6, v12, v11
 ; GCN-NEXT:    v_or_b32_e32 v7, v14, v13
 ; GCN-NEXT:    v_or_b32_e32 v8, v16, v15
-; GCN-NEXT:    v_or_b32_e32 v9, v18, v17
-; GCN-NEXT:    v_or_b32_e32 v10, v22, v20
-; GCN-NEXT:    v_or_b32_e32 v11, v24, v23
-; GCN-NEXT:    v_or_b32_e32 v12, v26, v25
-; GCN-NEXT:    v_or_b32_e32 v13, v28, v27
-; GCN-NEXT:    v_or_b32_e32 v14, v21, v29
-; GCN-NEXT:    v_or_b32_e32 v15, v30, v19
+; GCN-NEXT:    v_or_b32_e32 v9, v20, v18
+; GCN-NEXT:    v_or_b32_e32 v10, v22, v21
+; GCN-NEXT:    v_or_b32_e32 v11, v24, v19
+; GCN-NEXT:    v_or_b32_e32 v12, v28, v26
+; GCN-NEXT:    v_or_b32_e32 v13, v23, v30
+; GCN-NEXT:    v_or_b32_e32 v14, v29, v25
+; GCN-NEXT:    v_or_b32_e32 v15, v17, v27
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -5978,38 +5980,31 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; VI-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; VI-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -6020,38 +6015,52 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
+; VI-NEXT:    s_waitcnt vmcnt(9)
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
+; VI-NEXT:    s_waitcnt vmcnt(8)
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; VI-NEXT:    s_waitcnt vmcnt(7)
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; VI-NEXT:    s_waitcnt vmcnt(6)
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; VI-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -6062,28 +6071,28 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr47
-; VI-NEXT:    ; implicit-def: $vgpr42
-; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr46
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr27
-; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr21
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -6110,18 +6119,18 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr17
 ; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr38
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr30
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -6151,23 +6160,23 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -6201,24 +6210,24 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr48
-; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr25
-; VI-NEXT:    ; implicit-def: $vgpr21
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:  .LBB13_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB13_4
@@ -6229,27 +6238,27 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v15, 0x300
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v51
-; VI-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v9, 3, v40
+; VI-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_add_u16_e32 v10, 3, v24
+; VI-NEXT:    v_add_u16_e32 v10, 3, v49
 ; VI-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v11, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v23
+; VI-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v12, 3, v60
-; VI-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_add_u16_e32 v12, 3, v38
+; VI-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
-; VI-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v58
+; VI-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v14, 3, v42
-; VI-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v14, 3, v45
+; VI-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
-; VI-NEXT:    v_or_b32_sdwa v17, v57, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v17, v60, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -6285,18 +6294,18 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
-; VI-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v53
-; VI-NEXT:    v_or_b32_sdwa v16, v23, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -6335,39 +6344,39 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v52
+; VI-NEXT:    v_add_u16_e32 v8, 3, v43
 ; VI-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v50
+; VI-NEXT:    v_add_u16_e32 v9, 3, v52
 ; VI-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
-; VI-NEXT:    v_add_u16_e32 v10, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v10, 3, v29
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v63
-; VI-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v12, 3, v61
+; VI-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v12, v12, v13
-; VI-NEXT:    v_add_u16_e32 v13, 3, v44
-; VI-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v39
+; VI-NEXT:    v_add_u16_e32 v14, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v16
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v19
-; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v27, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
 ; VI-NEXT:  .LBB13_4: ; %end
@@ -6426,39 +6435,32 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -6468,50 +6470,57 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; GFX9-NEXT:    s_waitcnt vmcnt(6)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; GFX9-NEXT:    s_waitcnt vmcnt(3)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; GFX9-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -6521,29 +6530,29 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr38
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr45
+; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
-; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr54
-; GFX9-NEXT:    ; implicit-def: $vgpr27
-; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr50
+; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -6570,18 +6579,18 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr38
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr30
+; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -6611,23 +6620,23 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -6661,24 +6670,24 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    ; implicit-def: $vgpr52
-; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr49
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr51
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr25
-; GFX9-NEXT:    ; implicit-def: $vgpr21
+; GFX9-NEXT:    ; implicit-def: $vgpr27
 ; GFX9-NEXT:  .LBB13_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB13_4
@@ -6688,28 +6697,28 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v51
-; GFX9-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v40
+; GFX9-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v24
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v30
-; GFX9-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v23
+; GFX9-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v60
-; GFX9-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
-; GFX9-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v58
+; GFX9-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v13, v13, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
-; GFX9-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v45
+; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v53
-; GFX9-NEXT:    v_or_b32_sdwa v15, v23, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
@@ -6746,18 +6755,18 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
-; GFX9-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v17
-; GFX9-NEXT:    v_or_b32_sdwa v16, v57, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v16, v60, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -6795,39 +6804,39 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v43
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v50
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v52
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
-; GFX9-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v29
+; GFX9-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v63
-; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v26
+; GFX9-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v48
-; GFX9-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v61
+; GFX9-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_or_b32_e32 v12, v12, v13
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v19
-; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v27, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GFX9-NEXT:  .LBB13_4: ; %end
@@ -11012,15 +11021,15 @@ define <16 x float> @bitcast_v32bf16_to_v16f32(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v18, v22, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v18, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v17, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_add_f32 v19, 0x40c00000, v21
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v19, 0x40c00000, v21
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v22, v18, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v16, v19, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v20, 0x400000, v19
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
@@ -12860,109 +12869,109 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v32, v4
 ; GCN-NEXT:    v_mov_b32_e32 v35, v2
 ; GCN-NEXT:    v_mov_b32_e32 v31, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v52
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v1
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v17, 8, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 8, v29
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v45
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 8, v44
-; GCN-NEXT:    s_waitcnt vmcnt(10)
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 24, v59
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v58
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 24, v57
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 24, v43
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v43
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v41
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v10
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v2
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v46, 24, v46
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v1
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v11
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v7
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -12970,13 +12979,13 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v31
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v42
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v40
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v32
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v41
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v55
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v33
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v40
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v54
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v34
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v55
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v53
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v35
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v36
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v37
@@ -12989,132 +12998,133 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v54
-; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v53
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v52
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v49
-; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v48
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v39
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v63
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v62
-; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v61
-; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
+; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v52
+; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v50
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v49
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v39
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v63
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v62
+; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v61
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v23
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v17, v33, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v34
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v33, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v26, v28, v29
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v46
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_or_b32_e32 v27, v28, v45
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v27, v35, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v29, v36, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v31, v38, v45
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v59
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v35
+; GCN-NEXT:    v_or_b32_e32 v30, v36, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_or_b32_e32 v32, v38, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v56
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v32
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v32
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v19, v7
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
+; GCN-NEXT:    v_or_b32_e32 v6, v19, v6
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v19, v5
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v4
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
+; GCN-NEXT:    v_or_b32_e32 v23, v23, v25
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v25, v22
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v26
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v27
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_or_b32_e32 v25, v59, v25
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v57, v30
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v32, v58, v8
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v28
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v30
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v32
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v32, v32, v8
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v9
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v8, v10
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v8, v12
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v8, v14
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v8, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v46, v19
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v8, v17
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v7
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v6
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v5
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v17, v21
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v23
 ; GCN-NEXT:    v_or_b32_e32 v5, v20, v22
-; GCN-NEXT:    v_or_b32_e32 v6, v23, v24
-; GCN-NEXT:    v_or_b32_e32 v7, v26, v28
-; GCN-NEXT:    v_or_b32_e32 v8, v27, v25
+; GCN-NEXT:    v_or_b32_e32 v6, v24, v25
+; GCN-NEXT:    v_or_b32_e32 v7, v26, v27
+; GCN-NEXT:    v_or_b32_e32 v8, v21, v28
 ; GCN-NEXT:    v_or_b32_e32 v9, v29, v30
 ; GCN-NEXT:    v_or_b32_e32 v10, v31, v32
 ; GCN-NEXT:    v_or_b32_e32 v11, v33, v34
 ; GCN-NEXT:    v_or_b32_e32 v12, v35, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v16
-; GCN-NEXT:    v_or_b32_e32 v15, v18, v19
+; GCN-NEXT:    v_or_b32_e32 v15, v18, v17
 ; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -13131,12 +13141,10 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr26
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr21
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr39
@@ -13144,57 +13152,59 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr27
 ; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; kill: killed $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; kill: killed $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; kill: killed $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:  .LBB25_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB25_4
@@ -13202,16 +13212,16 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v42, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v40, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v41, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v55, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v40, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v54, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v55, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v53, v3
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v35
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
@@ -13226,29 +13236,30 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v26
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v28
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v30
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v25
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v54
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v53
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v21
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v51
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v50
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v49
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v48
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v39
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v49
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v48
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v39
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v61
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v19
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v51
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v17
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v8
+; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v8
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
@@ -13259,59 +13270,51 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v4
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
+; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v6, v17, v35
+; GCN-NEXT:    v_or_b32_e32 v6, v56, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
+; GCN-NEXT:    v_or_b32_e32 v9, v47, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
+; GCN-NEXT:    v_or_b32_e32 v11, v46, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v13, v29, v14
+; GCN-NEXT:    v_or_b32_e32 v13, v45, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v15, v27, v16
+; GCN-NEXT:    v_or_b32_e32 v15, v59, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v18
-; GCN-NEXT:    v_or_b32_e32 v17, v44, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v20, v45, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v56, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v30
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v29, v43, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v19, v47, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v18, v58, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v21, v57, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v26, v41, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v25, v43, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_or_b32_e32 v27, v44, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v34
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -13341,29 +13344,37 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s7, v15
-; GCN-NEXT:    v_or_b32_e32 v16, v59, v16
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, s7, v17
-; GCN-NEXT:    v_or_b32_e32 v18, v57, v18
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
-; GCN-NEXT:    v_or_b32_e32 v22, v58, v22
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, s7, v23
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, s7, v18
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, s7, v27
+; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
+; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 0x300, v19
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v30
+; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, s7, v26
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, s7, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x300, v27
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
@@ -13373,13 +13384,13 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_or_b32_e32 v0, v31, v0
 ; GCN-NEXT:    v_or_b32_e32 v1, v8, v1
 ; GCN-NEXT:    v_or_b32_e32 v2, v5, v2
@@ -13389,13 +13400,13 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v6, v12, v11
 ; GCN-NEXT:    v_or_b32_e32 v7, v14, v13
 ; GCN-NEXT:    v_or_b32_e32 v8, v16, v15
-; GCN-NEXT:    v_or_b32_e32 v9, v18, v17
-; GCN-NEXT:    v_or_b32_e32 v10, v22, v20
-; GCN-NEXT:    v_or_b32_e32 v11, v24, v23
-; GCN-NEXT:    v_or_b32_e32 v12, v26, v25
-; GCN-NEXT:    v_or_b32_e32 v13, v28, v27
-; GCN-NEXT:    v_or_b32_e32 v14, v21, v29
-; GCN-NEXT:    v_or_b32_e32 v15, v30, v19
+; GCN-NEXT:    v_or_b32_e32 v9, v20, v18
+; GCN-NEXT:    v_or_b32_e32 v10, v22, v21
+; GCN-NEXT:    v_or_b32_e32 v11, v24, v19
+; GCN-NEXT:    v_or_b32_e32 v12, v28, v26
+; GCN-NEXT:    v_or_b32_e32 v13, v23, v30
+; GCN-NEXT:    v_or_b32_e32 v14, v29, v25
+; GCN-NEXT:    v_or_b32_e32 v15, v17, v27
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -13468,38 +13479,31 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; VI-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; VI-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -13510,38 +13514,52 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; VI-NEXT:    s_waitcnt vmcnt(7)
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; VI-NEXT:    s_waitcnt vmcnt(6)
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; VI-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
+; VI-NEXT:    s_waitcnt vmcnt(9)
+; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
+; VI-NEXT:    s_waitcnt vmcnt(8)
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -13552,28 +13570,28 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr47
-; VI-NEXT:    ; implicit-def: $vgpr42
-; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr46
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr27
-; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr21
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -13600,18 +13618,18 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr17
 ; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr38
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr30
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -13641,23 +13659,23 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -13691,24 +13709,24 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr48
-; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr25
-; VI-NEXT:    ; implicit-def: $vgpr21
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:  .LBB25_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB25_4
@@ -13719,27 +13737,27 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v15, 0x300
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v51
-; VI-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v9, 3, v40
+; VI-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_add_u16_e32 v10, 3, v24
+; VI-NEXT:    v_add_u16_e32 v10, 3, v49
 ; VI-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v11, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v23
+; VI-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v12, 3, v60
-; VI-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_add_u16_e32 v12, 3, v38
+; VI-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
-; VI-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v58
+; VI-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v14, 3, v42
-; VI-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v14, 3, v45
+; VI-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
-; VI-NEXT:    v_or_b32_sdwa v17, v57, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v17, v60, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -13775,18 +13793,18 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
-; VI-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v53
-; VI-NEXT:    v_or_b32_sdwa v16, v23, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -13825,39 +13843,39 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v52
+; VI-NEXT:    v_add_u16_e32 v8, 3, v43
 ; VI-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v50
+; VI-NEXT:    v_add_u16_e32 v9, 3, v52
 ; VI-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
-; VI-NEXT:    v_add_u16_e32 v10, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v10, 3, v29
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v63
-; VI-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v12, 3, v61
+; VI-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v12, v12, v13
-; VI-NEXT:    v_add_u16_e32 v13, 3, v44
-; VI-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v39
+; VI-NEXT:    v_add_u16_e32 v14, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v16
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v19
-; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v27, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
 ; VI-NEXT:  .LBB25_4: ; %end
@@ -13916,39 +13934,32 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -13958,50 +13969,57 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; GFX9-NEXT:    s_waitcnt vmcnt(6)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; GFX9-NEXT:    s_waitcnt vmcnt(3)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; GFX9-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -14011,29 +14029,29 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr38
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr45
+; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
-; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr54
-; GFX9-NEXT:    ; implicit-def: $vgpr27
-; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr50
+; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -14060,18 +14078,18 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr38
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr30
+; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -14101,23 +14119,23 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -14151,24 +14169,24 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    ; implicit-def: $vgpr52
-; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr49
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr51
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr25
-; GFX9-NEXT:    ; implicit-def: $vgpr21
+; GFX9-NEXT:    ; implicit-def: $vgpr27
 ; GFX9-NEXT:  .LBB25_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB25_4
@@ -14178,28 +14196,28 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v51
-; GFX9-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v40
+; GFX9-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v24
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v30
-; GFX9-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v23
+; GFX9-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v60
-; GFX9-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
-; GFX9-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v58
+; GFX9-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v13, v13, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
-; GFX9-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v45
+; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v53
-; GFX9-NEXT:    v_or_b32_sdwa v15, v23, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
@@ -14236,18 +14254,18 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
-; GFX9-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v17
-; GFX9-NEXT:    v_or_b32_sdwa v16, v57, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v16, v60, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -14285,39 +14303,39 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v43
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v50
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v52
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
-; GFX9-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v29
+; GFX9-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v63
-; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v26
+; GFX9-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v48
-; GFX9-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v61
+; GFX9-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_or_b32_e32 v12, v12, v13
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v19
-; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v27, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GFX9-NEXT:  .LBB25_4: ; %end
@@ -18286,15 +18304,15 @@ define <8 x i64> @bitcast_v32bf16_to_v8i64(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v18, v22, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v18, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v17, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_add_f32 v19, 0x40c00000, v21
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v19, 0x40c00000, v21
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v22, v18, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v16, v19, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v20, 0x400000, v19
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
@@ -20160,109 +20178,109 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v32, v4
 ; GCN-NEXT:    v_mov_b32_e32 v35, v2
 ; GCN-NEXT:    v_mov_b32_e32 v31, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v52
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v1
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v17, 8, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 8, v29
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v45
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 8, v44
-; GCN-NEXT:    s_waitcnt vmcnt(10)
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 24, v59
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v58
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 24, v57
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 24, v43
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v43
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v41
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v10
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v2
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v46, 24, v46
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v1
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v11
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v7
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -20270,13 +20288,13 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v31
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v42
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v40
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v32
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v41
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v55
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v33
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v40
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v54
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v34
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v55
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v53
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v35
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v36
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v37
@@ -20289,132 +20307,133 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v54
-; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v53
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v52
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v49
-; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v48
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v39
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v63
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v62
-; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v61
-; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
+; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v52
+; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v50
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v49
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v39
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v63
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v62
+; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v61
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v23
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v17, v33, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v34
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v33, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v26, v28, v29
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v46
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_or_b32_e32 v27, v28, v45
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v27, v35, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v29, v36, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v31, v38, v45
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v59
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v35
+; GCN-NEXT:    v_or_b32_e32 v30, v36, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_or_b32_e32 v32, v38, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v56
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v32
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v32
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v19, v7
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
+; GCN-NEXT:    v_or_b32_e32 v6, v19, v6
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v19, v5
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v4
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
+; GCN-NEXT:    v_or_b32_e32 v23, v23, v25
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v25, v22
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v26
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v27
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_or_b32_e32 v25, v59, v25
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v57, v30
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v32, v58, v8
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v28
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v30
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v32
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v32, v32, v8
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v9
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v8, v10
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v8, v12
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v8, v14
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v8, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v46, v19
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v8, v17
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v7
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v6
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v5
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v17, v21
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v23
 ; GCN-NEXT:    v_or_b32_e32 v5, v20, v22
-; GCN-NEXT:    v_or_b32_e32 v6, v23, v24
-; GCN-NEXT:    v_or_b32_e32 v7, v26, v28
-; GCN-NEXT:    v_or_b32_e32 v8, v27, v25
+; GCN-NEXT:    v_or_b32_e32 v6, v24, v25
+; GCN-NEXT:    v_or_b32_e32 v7, v26, v27
+; GCN-NEXT:    v_or_b32_e32 v8, v21, v28
 ; GCN-NEXT:    v_or_b32_e32 v9, v29, v30
 ; GCN-NEXT:    v_or_b32_e32 v10, v31, v32
 ; GCN-NEXT:    v_or_b32_e32 v11, v33, v34
 ; GCN-NEXT:    v_or_b32_e32 v12, v35, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v16
-; GCN-NEXT:    v_or_b32_e32 v15, v18, v19
+; GCN-NEXT:    v_or_b32_e32 v15, v18, v17
 ; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -20431,12 +20450,10 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr26
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr21
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr39
@@ -20444,57 +20461,59 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr27
 ; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; kill: killed $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; kill: killed $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; kill: killed $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:  .LBB35_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB35_4
@@ -20502,16 +20521,16 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v42, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v40, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v41, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v55, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v40, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v54, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v55, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v53, v3
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v35
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
@@ -20526,29 +20545,30 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v26
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v28
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v30
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v25
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v54
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v53
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v21
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v51
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v50
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v49
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v48
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v39
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v49
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v48
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v39
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v61
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v19
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v51
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v17
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v8
+; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v8
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
@@ -20559,59 +20579,51 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v4
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
+; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v6, v17, v35
+; GCN-NEXT:    v_or_b32_e32 v6, v56, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
+; GCN-NEXT:    v_or_b32_e32 v9, v47, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
+; GCN-NEXT:    v_or_b32_e32 v11, v46, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v13, v29, v14
+; GCN-NEXT:    v_or_b32_e32 v13, v45, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v15, v27, v16
+; GCN-NEXT:    v_or_b32_e32 v15, v59, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v18
-; GCN-NEXT:    v_or_b32_e32 v17, v44, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v20, v45, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v56, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v30
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v29, v43, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v19, v47, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v18, v58, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v21, v57, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v26, v41, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v25, v43, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_or_b32_e32 v27, v44, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v34
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -20641,29 +20653,37 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s7, v15
-; GCN-NEXT:    v_or_b32_e32 v16, v59, v16
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, s7, v17
-; GCN-NEXT:    v_or_b32_e32 v18, v57, v18
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
-; GCN-NEXT:    v_or_b32_e32 v22, v58, v22
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, s7, v23
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, s7, v18
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, s7, v27
+; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
+; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 0x300, v19
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v30
+; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, s7, v26
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, s7, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x300, v27
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
@@ -20673,13 +20693,13 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_or_b32_e32 v0, v31, v0
 ; GCN-NEXT:    v_or_b32_e32 v1, v8, v1
 ; GCN-NEXT:    v_or_b32_e32 v2, v5, v2
@@ -20689,13 +20709,13 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v6, v12, v11
 ; GCN-NEXT:    v_or_b32_e32 v7, v14, v13
 ; GCN-NEXT:    v_or_b32_e32 v8, v16, v15
-; GCN-NEXT:    v_or_b32_e32 v9, v18, v17
-; GCN-NEXT:    v_or_b32_e32 v10, v22, v20
-; GCN-NEXT:    v_or_b32_e32 v11, v24, v23
-; GCN-NEXT:    v_or_b32_e32 v12, v26, v25
-; GCN-NEXT:    v_or_b32_e32 v13, v28, v27
-; GCN-NEXT:    v_or_b32_e32 v14, v21, v29
-; GCN-NEXT:    v_or_b32_e32 v15, v30, v19
+; GCN-NEXT:    v_or_b32_e32 v9, v20, v18
+; GCN-NEXT:    v_or_b32_e32 v10, v22, v21
+; GCN-NEXT:    v_or_b32_e32 v11, v24, v19
+; GCN-NEXT:    v_or_b32_e32 v12, v28, v26
+; GCN-NEXT:    v_or_b32_e32 v13, v23, v30
+; GCN-NEXT:    v_or_b32_e32 v14, v29, v25
+; GCN-NEXT:    v_or_b32_e32 v15, v17, v27
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -20768,38 +20788,31 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; VI-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; VI-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -20810,38 +20823,52 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; VI-NEXT:    s_waitcnt vmcnt(7)
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; VI-NEXT:    s_waitcnt vmcnt(6)
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; VI-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
+; VI-NEXT:    s_waitcnt vmcnt(9)
+; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
+; VI-NEXT:    s_waitcnt vmcnt(8)
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -20852,28 +20879,28 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr47
-; VI-NEXT:    ; implicit-def: $vgpr42
-; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr46
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr27
-; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr21
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -20900,18 +20927,18 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr17
 ; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr38
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr30
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -20941,23 +20968,23 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -20991,24 +21018,24 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr48
-; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr25
-; VI-NEXT:    ; implicit-def: $vgpr21
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:  .LBB35_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB35_4
@@ -21019,27 +21046,27 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v15, 0x300
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v51
-; VI-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v9, 3, v40
+; VI-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_add_u16_e32 v10, 3, v24
+; VI-NEXT:    v_add_u16_e32 v10, 3, v49
 ; VI-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v11, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v23
+; VI-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v12, 3, v60
-; VI-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_add_u16_e32 v12, 3, v38
+; VI-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
-; VI-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v58
+; VI-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v14, 3, v42
-; VI-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v14, 3, v45
+; VI-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
-; VI-NEXT:    v_or_b32_sdwa v17, v57, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v17, v60, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -21075,18 +21102,18 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
-; VI-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v53
-; VI-NEXT:    v_or_b32_sdwa v16, v23, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -21125,39 +21152,39 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v52
+; VI-NEXT:    v_add_u16_e32 v8, 3, v43
 ; VI-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v50
+; VI-NEXT:    v_add_u16_e32 v9, 3, v52
 ; VI-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
-; VI-NEXT:    v_add_u16_e32 v10, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v10, 3, v29
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v63
-; VI-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v12, 3, v61
+; VI-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v12, v12, v13
-; VI-NEXT:    v_add_u16_e32 v13, 3, v44
-; VI-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v39
+; VI-NEXT:    v_add_u16_e32 v14, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v16
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v19
-; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v27, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
 ; VI-NEXT:  .LBB35_4: ; %end
@@ -21216,39 +21243,32 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -21258,50 +21278,57 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; GFX9-NEXT:    s_waitcnt vmcnt(6)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; GFX9-NEXT:    s_waitcnt vmcnt(3)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; GFX9-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -21311,29 +21338,29 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr38
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr45
+; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
-; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr54
-; GFX9-NEXT:    ; implicit-def: $vgpr27
-; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr50
+; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -21360,18 +21387,18 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr38
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr30
+; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -21401,23 +21428,23 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -21451,24 +21478,24 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    ; implicit-def: $vgpr52
-; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr49
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr51
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr25
-; GFX9-NEXT:    ; implicit-def: $vgpr21
+; GFX9-NEXT:    ; implicit-def: $vgpr27
 ; GFX9-NEXT:  .LBB35_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB35_4
@@ -21478,28 +21505,28 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v51
-; GFX9-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v40
+; GFX9-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v24
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v30
-; GFX9-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v23
+; GFX9-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v60
-; GFX9-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
-; GFX9-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v58
+; GFX9-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v13, v13, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
-; GFX9-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v45
+; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v53
-; GFX9-NEXT:    v_or_b32_sdwa v15, v23, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
@@ -21536,18 +21563,18 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
-; GFX9-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v17
-; GFX9-NEXT:    v_or_b32_sdwa v16, v57, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v16, v60, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -21585,39 +21612,39 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v43
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v50
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v52
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
-; GFX9-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v29
+; GFX9-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v63
-; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v26
+; GFX9-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v48
-; GFX9-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v61
+; GFX9-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_or_b32_e32 v12, v12, v13
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v19
-; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v27, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GFX9-NEXT:  .LBB35_4: ; %end
@@ -25220,15 +25247,15 @@ define <8 x double> @bitcast_v32bf16_to_v8f64(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v18, v22, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v18, 16, v6
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v7
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v8, v17, 0x7060302
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v18, 0x40c00000, v18
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_add_f32 v19, 0x40c00000, v21
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v19, 0x40c00000, v21
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v22, v18, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v16, v19, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v20, 0x400000, v19
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
@@ -27064,109 +27091,109 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_mov_b32_e32 v32, v4
 ; GCN-NEXT:    v_mov_b32_e32 v35, v2
 ; GCN-NEXT:    v_mov_b32_e32 v31, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:72
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v52
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:16
-; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v1
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:12
+; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v1
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v3
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v5
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v7
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 8, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 8, v9
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v11
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v55, 8, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v53, 8, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v15
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v17, 8, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v17
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v19
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v21
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v25
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v25
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v27
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 8, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v29
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v45
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 8, v44
-; GCN-NEXT:    s_waitcnt vmcnt(10)
-; GCN-NEXT:    v_lshlrev_b32_e32 v59, 24, v59
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v58
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 24, v57
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 24, v43
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 8, v14
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v59, 8, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v43
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v42
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v10
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v41
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v14
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v6
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 8, v10
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 8, v6
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v4
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v2
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 8, v2
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:92
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v46, 24, v46
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v1
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:108
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v9
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v11
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v0, 24, v7
+; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -27174,13 +27201,13 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v31
-; GCN-NEXT:    v_or_b32_e32 v0, v0, v42
+; GCN-NEXT:    v_or_b32_e32 v0, v0, v40
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v32
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v41
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v55
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v33
-; GCN-NEXT:    v_or_b32_e32 v2, v2, v40
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v54
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v34
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v55
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v53
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v35
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v36
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v37
@@ -27193,132 +27220,133 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
-; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v54
-; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v53
-; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v52
-; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v49
-; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v48
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v39
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v63
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v62
-; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v61
-; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
+; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v52
+; GCN-NEXT:    v_and_b32_e32 v36, 0xff, v51
+; GCN-NEXT:    v_and_b32_e32 v37, 0xff, v50
+; GCN-NEXT:    v_and_b32_e32 v38, 0xff, v49
+; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v48
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v39
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v63
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v62
+; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v61
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v23
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v17, v33, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v34
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v23
+; GCN-NEXT:    v_or_b32_e32 v23, v33, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
-; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v23, v24, v23
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v26, v28, v29
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v46
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_or_b32_e32 v27, v28, v45
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v27, v35, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v29, v36, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v37
-; GCN-NEXT:    v_or_b32_e32 v31, v38, v45
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v59
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v35
+; GCN-NEXT:    v_or_b32_e32 v30, v36, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v37
+; GCN-NEXT:    v_or_b32_e32 v32, v38, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v56
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v32
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v41
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v32
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v42
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GCN-NEXT:    v_or_b32_e32 v15, v15, v43
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v44
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v7, v32, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v19, v7
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
+; GCN-NEXT:    v_or_b32_e32 v6, v19, v6
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v5, v32, v5
+; GCN-NEXT:    v_or_b32_e32 v5, v19, v5
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v4
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v23
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
+; GCN-NEXT:    v_or_b32_e32 v23, v23, v25
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v22, v25, v22
+; GCN-NEXT:    v_and_b32_e32 v24, 0xffff, v24
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v26
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v27
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_or_b32_e32 v25, v59, v25
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_or_b32_e32 v30, v57, v30
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_or_b32_e32 v32, v58, v8
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v28
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
+; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v30
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
+; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v32
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v32, v32, v8
 ; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v9
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v8, v10
 ; GCN-NEXT:    v_and_b32_e32 v35, 0xffff, v11
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v8, v12
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v8, v14
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v8, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
-; GCN-NEXT:    v_or_b32_e32 v19, v46, v19
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v8, v17
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v7
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v6
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v5
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v17, v21
+; GCN-NEXT:    v_or_b32_e32 v4, v19, v23
 ; GCN-NEXT:    v_or_b32_e32 v5, v20, v22
-; GCN-NEXT:    v_or_b32_e32 v6, v23, v24
-; GCN-NEXT:    v_or_b32_e32 v7, v26, v28
-; GCN-NEXT:    v_or_b32_e32 v8, v27, v25
+; GCN-NEXT:    v_or_b32_e32 v6, v24, v25
+; GCN-NEXT:    v_or_b32_e32 v7, v26, v27
+; GCN-NEXT:    v_or_b32_e32 v8, v21, v28
 ; GCN-NEXT:    v_or_b32_e32 v9, v29, v30
 ; GCN-NEXT:    v_or_b32_e32 v10, v31, v32
 ; GCN-NEXT:    v_or_b32_e32 v11, v33, v34
 ; GCN-NEXT:    v_or_b32_e32 v12, v35, v12
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
 ; GCN-NEXT:    v_or_b32_e32 v14, v15, v16
-; GCN-NEXT:    v_or_b32_e32 v15, v18, v19
+; GCN-NEXT:    v_or_b32_e32 v15, v18, v17
 ; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr32
@@ -27335,12 +27363,10 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr26
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr50
-; GCN-NEXT:    ; implicit-def: $vgpr25
-; GCN-NEXT:    ; implicit-def: $vgpr54
-; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr21
 ; GCN-NEXT:    ; implicit-def: $vgpr52
-; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr51
+; GCN-NEXT:    ; implicit-def: $vgpr50
 ; GCN-NEXT:    ; implicit-def: $vgpr49
 ; GCN-NEXT:    ; implicit-def: $vgpr48
 ; GCN-NEXT:    ; implicit-def: $vgpr39
@@ -27348,57 +27374,59 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr21
-; GCN-NEXT:    ; implicit-def: $vgpr19
-; GCN-NEXT:    ; implicit-def: $vgpr51
-; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr23
+; GCN-NEXT:    ; implicit-def: $vgpr25
+; GCN-NEXT:    ; implicit-def: $vgpr29
+; GCN-NEXT:    ; implicit-def: $vgpr27
 ; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
 ; GCN-NEXT:    ; implicit-def: $vgpr40
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr55
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; kill: killed $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr17
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr29
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; kill: killed $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr27
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr54
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr53
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr56
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
 ; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; kill: killed $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; kill: killed $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr42
 ; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; kill: killed $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; kill: killed $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr19
+; GCN-NEXT:    ; kill: killed $vgpr19
+; GCN-NEXT:    ; implicit-def: $vgpr19
 ; GCN-NEXT:  .LBB43_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB43_4
@@ -27406,16 +27434,16 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 3, v31
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    v_or_b32_e32 v0, v42, v0
+; GCN-NEXT:    v_or_b32_e32 v0, v40, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v32
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v41, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v55, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 3, v33
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v40, v2
+; GCN-NEXT:    v_or_b32_e32 v2, v54, v2
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v34
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v55, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v53, v3
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, 3, v35
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
@@ -27430,29 +27458,30 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v26
 ; GCN-NEXT:    v_add_i32_e32 v14, vcc, 3, v28
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v30
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v50
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v25
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v54
-; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v53
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v21
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 3, v52
+; GCN-NEXT:    v_add_i32_e32 v20, vcc, 3, v51
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v50
+; GCN-NEXT:    v_add_i32_e32 v22, vcc, 3, v49
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v48
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v39
+; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v60
+; GCN-NEXT:    s_waitcnt vmcnt(6)
 ; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v23
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v49
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v48
-; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v39
-; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v61
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v60
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v19
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v51
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v25
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v27
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v17
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
 ; GCN-NEXT:    v_and_b32_e32 v6, 0xff, v6
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v8
+; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v8
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v9
 ; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v10
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v11
@@ -27463,59 +27492,51 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v16
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v20
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v22
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
 ; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v26
 ; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v30
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v31
 ; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v32
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v33
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v21
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v19
-; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v34
-; GCN-NEXT:    v_lshlrev_b32_e32 v36, 16, v4
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v23
+; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v25
+; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v29
+; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v27
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v34, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v7
-; GCN-NEXT:    v_or_b32_e32 v6, v17, v35
+; GCN-NEXT:    v_or_b32_e32 v6, v56, v33
 ; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v9
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
+; GCN-NEXT:    v_or_b32_e32 v9, v47, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v11
-; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
+; GCN-NEXT:    v_or_b32_e32 v11, v46, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v13, v29, v14
+; GCN-NEXT:    v_or_b32_e32 v13, v45, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v15, v27, v16
+; GCN-NEXT:    v_or_b32_e32 v15, v59, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v18
-; GCN-NEXT:    v_or_b32_e32 v17, v44, v20
-; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v20, v45, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v23
-; GCN-NEXT:    v_or_b32_e32 v23, v56, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v26
-; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v25, v25, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v30
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v27, v27, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v29, v43, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v19, v47, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v18, v58, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v21, v57, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v28
+; GCN-NEXT:    v_or_b32_e32 v26, v41, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v31
+; GCN-NEXT:    v_or_b32_e32 v30, v42, v32
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_or_b32_e32 v25, v43, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_or_b32_e32 v27, v44, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, 0x300, v0
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v36
+; GCN-NEXT:    v_or_b32_e32 v31, v31, v34
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s7, v1
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -27545,29 +27566,37 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
 ; GCN-NEXT:    v_add_i32_e32 v15, vcc, s7, v15
-; GCN-NEXT:    v_or_b32_e32 v16, v59, v16
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, s7, v17
-; GCN-NEXT:    v_or_b32_e32 v18, v57, v18
-; GCN-NEXT:    v_add_i32_e32 v20, vcc, s7, v20
-; GCN-NEXT:    v_or_b32_e32 v22, v58, v22
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, s7, v23
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, s7, v18
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, s7, v27
+; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, s7, v21
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
-; GCN-NEXT:    v_add_i32_e32 v29, vcc, s7, v29
+; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, s7, v19
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 0x300, v19
-; GCN-NEXT:    v_or_b32_e32 v30, v46, v30
+; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, s7, v26
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
+; GCN-NEXT:    v_add_i32_e32 v30, vcc, s7, v30
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, s7, v25
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 0x300, v27
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
@@ -27577,13 +27606,13 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
+; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
+; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_and_b32_e32 v26, 0xffff, v26
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
-; GCN-NEXT:    v_and_b32_e32 v29, 0xffff, v29
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_or_b32_e32 v0, v31, v0
 ; GCN-NEXT:    v_or_b32_e32 v1, v8, v1
 ; GCN-NEXT:    v_or_b32_e32 v2, v5, v2
@@ -27593,13 +27622,13 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v6, v12, v11
 ; GCN-NEXT:    v_or_b32_e32 v7, v14, v13
 ; GCN-NEXT:    v_or_b32_e32 v8, v16, v15
-; GCN-NEXT:    v_or_b32_e32 v9, v18, v17
-; GCN-NEXT:    v_or_b32_e32 v10, v22, v20
-; GCN-NEXT:    v_or_b32_e32 v11, v24, v23
-; GCN-NEXT:    v_or_b32_e32 v12, v26, v25
-; GCN-NEXT:    v_or_b32_e32 v13, v28, v27
-; GCN-NEXT:    v_or_b32_e32 v14, v21, v29
-; GCN-NEXT:    v_or_b32_e32 v15, v30, v19
+; GCN-NEXT:    v_or_b32_e32 v9, v20, v18
+; GCN-NEXT:    v_or_b32_e32 v10, v22, v21
+; GCN-NEXT:    v_or_b32_e32 v11, v24, v19
+; GCN-NEXT:    v_or_b32_e32 v12, v28, v26
+; GCN-NEXT:    v_or_b32_e32 v13, v23, v30
+; GCN-NEXT:    v_or_b32_e32 v14, v29, v25
+; GCN-NEXT:    v_or_b32_e32 v15, v17, v27
 ; GCN-NEXT:    v_add_i32_e32 v0, vcc, s6, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, s6, v1
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, s6, v2
@@ -27672,38 +27701,31 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; VI-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; VI-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; VI-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; VI-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -27714,38 +27736,52 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; VI-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; VI-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
+; VI-NEXT:    s_waitcnt vmcnt(9)
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
+; VI-NEXT:    s_waitcnt vmcnt(8)
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; VI-NEXT:    s_waitcnt vmcnt(7)
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; VI-NEXT:    s_waitcnt vmcnt(6)
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; VI-NEXT:    s_waitcnt vmcnt(3)
+; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; VI-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; VI-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; VI-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; VI-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; VI-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -27756,28 +27792,28 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr47
-; VI-NEXT:    ; implicit-def: $vgpr42
-; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr49
+; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr58
+; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    ; implicit-def: $vgpr53
+; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr46
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr40
+; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr27
-; VI-NEXT:    ; implicit-def: $vgpr23
+; VI-NEXT:    ; implicit-def: $vgpr50
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr21
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -27804,18 +27840,18 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr17
 ; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr38
-; VI-NEXT:    ; implicit-def: $vgpr57
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr30
+; VI-NEXT:    ; implicit-def: $vgpr60
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr62
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -27845,23 +27881,23 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr28
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -27895,24 +27931,24 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr48
-; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr47
+; VI-NEXT:    ; implicit-def: $vgpr41
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr63
 ; VI-NEXT:    ; implicit-def: $vgpr59
 ; VI-NEXT:    ; implicit-def: $vgpr56
-; VI-NEXT:    ; implicit-def: $vgpr45
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr44
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr48
 ; VI-NEXT:    ; implicit-def: $vgpr25
-; VI-NEXT:    ; implicit-def: $vgpr21
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:  .LBB43_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB43_4
@@ -27923,27 +27959,27 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    v_mov_b32_e32 v15, 0x300
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_add_u16_e32 v9, 3, v51
-; VI-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v9, 3, v40
+; VI-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v9, v9, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_add_u16_e32 v10, 3, v24
+; VI-NEXT:    v_add_u16_e32 v10, 3, v49
 ; VI-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v10, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v11, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v23
+; VI-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v11, v11, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v12, 3, v60
-; VI-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_add_u16_e32 v12, 3, v38
+; VI-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v12, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v13, 3, v47
-; VI-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v58
+; VI-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v13, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v14, 3, v42
-; VI-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v14, 3, v45
+; VI-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v14, v14, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v17, 3, v17
-; VI-NEXT:    v_or_b32_sdwa v17, v57, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v17, v60, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -27979,18 +28015,18 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_add_u16_sdwa v5, v5, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
 ; VI-NEXT:    v_add_u16_e32 v6, 3, v6
-; VI-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v6, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v53
-; VI-NEXT:    v_or_b32_sdwa v16, v23, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v16, v16, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
 ; VI-NEXT:    v_add_u16_e32 v8, 3, v8
-; VI-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v8, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
 ; VI-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -28029,39 +28065,39 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v7, 3, v7
-; VI-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; VI-NEXT:    v_or_b32_e32 v7, v7, v8
-; VI-NEXT:    v_add_u16_e32 v8, 3, v52
+; VI-NEXT:    v_add_u16_e32 v8, 3, v43
 ; VI-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; VI-NEXT:    v_or_b32_e32 v8, v8, v9
-; VI-NEXT:    v_add_u16_e32 v9, 3, v50
+; VI-NEXT:    v_add_u16_e32 v9, 3, v52
 ; VI-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; VI-NEXT:    v_or_b32_e32 v9, v9, v10
-; VI-NEXT:    v_add_u16_e32 v10, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v10, 3, v29
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; VI-NEXT:    v_or_b32_e32 v10, v10, v11
-; VI-NEXT:    v_add_u16_e32 v11, 3, v63
-; VI-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v11, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; VI-NEXT:    v_or_b32_e32 v11, v11, v12
-; VI-NEXT:    v_add_u16_e32 v12, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v12, 3, v61
+; VI-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; VI-NEXT:    v_or_b32_e32 v12, v12, v13
-; VI-NEXT:    v_add_u16_e32 v13, 3, v44
-; VI-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v13, 3, v47
+; VI-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; VI-NEXT:    v_or_b32_e32 v13, v13, v14
-; VI-NEXT:    v_add_u16_e32 v14, 3, v39
+; VI-NEXT:    v_add_u16_e32 v14, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_or_b32_e32 v14, v14, v16
 ; VI-NEXT:    v_add_u16_e32 v16, 3, v19
-; VI-NEXT:    v_or_b32_sdwa v16, v21, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v16, v27, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
 ; VI-NEXT:    v_or_b32_e32 v15, v16, v15
 ; VI-NEXT:  .LBB43_4: ; %end
@@ -28120,39 +28156,32 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:80
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:88
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:96
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v57, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v50, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:88
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:96
+; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:104
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v21
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v23
+; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v23
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v25
-; GFX9-NEXT:    v_lshlrev_b16_e32 v38, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v29
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v27
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v29
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v17
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v19
-; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
 ; GFX9-NEXT:    buffer_load_ushort v17, off, s[0:3], s32 offset:124
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v1
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v32, 8, v3
@@ -28162,50 +28191,57 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v11
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v37, 8, v13
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v16, 8, v15
-; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v54
-; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v2
-; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v4
-; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v8
-; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v12
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v24
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v30
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v53
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v42
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v44
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v47
+; GFX9-NEXT:    v_lshlrev_b16_e32 v42, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v57
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v60
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v38
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v63
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v39
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v48
+; GFX9-NEXT:    s_waitcnt vmcnt(6)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v49
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v52
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v53
+; GFX9-NEXT:    s_waitcnt vmcnt(3)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v40
+; GFX9-NEXT:    buffer_load_ushort v23, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v40, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v19, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v41
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v45
 ; GFX9-NEXT:    buffer_load_ushort v53, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v60, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v45, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v58, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
@@ -28215,29 +28251,29 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v51, v58 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v24, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v30, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v60, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v42, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v40, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v46 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v23, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v38, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v58, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v45, v39 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v53, v21 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr38
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
+; GFX9-NEXT:    ; implicit-def: $vgpr45
+; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
-; GFX9-NEXT:    ; implicit-def: $vgpr43
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr42
 ; GFX9-NEXT:    ; implicit-def: $vgpr54
-; GFX9-NEXT:    ; implicit-def: $vgpr27
-; GFX9-NEXT:    ; implicit-def: $vgpr23
+; GFX9-NEXT:    ; implicit-def: $vgpr50
+; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
@@ -28264,18 +28300,18 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v20 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v26 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v30 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v17, v60 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr38
-; GFX9-NEXT:    ; implicit-def: $vgpr57
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr30
+; GFX9-NEXT:    ; implicit-def: $vgpr60
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr62
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v35 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -28305,23 +28341,23 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr28
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v8, v52, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v43, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v9, v50, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v10, v49, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v29, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v26, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v48, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v61, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v44, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v47, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v41, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v19, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -28355,24 +28391,24 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    ; implicit-def: $vgpr52
-; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr49
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr39
+; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr47
+; GFX9-NEXT:    ; implicit-def: $vgpr41
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
-; GFX9-NEXT:    ; implicit-def: $vgpr45
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr51
+; GFX9-NEXT:    ; implicit-def: $vgpr48
 ; GFX9-NEXT:    ; implicit-def: $vgpr25
-; GFX9-NEXT:    ; implicit-def: $vgpr21
+; GFX9-NEXT:    ; implicit-def: $vgpr27
 ; GFX9-NEXT:  .LBB43_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB43_4
@@ -28382,28 +28418,28 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_movk_i32 s6, 0x300
-; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v51
-; GFX9-NEXT:    v_or_b32_sdwa v9, v58, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(14)
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v40
+; GFX9-NEXT:    v_or_b32_sdwa v9, v57, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v9, v9, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v24
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
 ; GFX9-NEXT:    v_or_b32_sdwa v10, v46, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v10, v10, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v30
-; GFX9-NEXT:    v_or_b32_sdwa v11, v43, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v23
+; GFX9-NEXT:    v_or_b32_sdwa v11, v42, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v11, v11, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v60
-; GFX9-NEXT:    v_or_b32_sdwa v12, v40, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v12, v54, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v12, v12, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
-; GFX9-NEXT:    v_or_b32_sdwa v13, v54, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v58
+; GFX9-NEXT:    v_or_b32_sdwa v13, v50, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v13, v13, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v42
-; GFX9-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v45
+; GFX9-NEXT:    v_or_b32_sdwa v14, v39, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v14, v14, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v53
-; GFX9-NEXT:    v_or_b32_sdwa v15, v23, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v15, v15, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
@@ -28440,18 +28476,18 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_add_u16_sdwa v5, v5, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_add_u16_e32 v6, 3, v6
-; GFX9-NEXT:    v_or_b32_sdwa v6, v26, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v6, v24, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v6, v6, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v38, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v30, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v7, v7, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    v_add_u16_e32 v16, 3, v17
-; GFX9-NEXT:    v_or_b32_sdwa v16, v57, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v16, v60, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v16, v16, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v8, 3, v8
-; GFX9-NEXT:    v_or_b32_sdwa v8, v61, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v8, v62, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_sdwa v8, v8, s6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v2, 3, v2
@@ -28489,39 +28525,39 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v7, 3, v7
-; GFX9-NEXT:    v_or_b32_sdwa v7, v62, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v7, v63, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v7
 ; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
-; GFX9-NEXT:    v_add_u16_e32 v8, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v8, 3, v43
 ; GFX9-NEXT:    v_or_b32_sdwa v8, v59, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v8
 ; GFX9-NEXT:    v_or_b32_e32 v8, v8, v9
-; GFX9-NEXT:    v_add_u16_e32 v9, 3, v50
+; GFX9-NEXT:    v_add_u16_e32 v9, 3, v52
 ; GFX9-NEXT:    v_or_b32_sdwa v9, v56, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v9
 ; GFX9-NEXT:    v_or_b32_e32 v9, v9, v10
-; GFX9-NEXT:    v_add_u16_e32 v10, 3, v49
-; GFX9-NEXT:    v_or_b32_sdwa v10, v45, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v10, 3, v29
+; GFX9-NEXT:    v_or_b32_sdwa v10, v44, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v10
 ; GFX9-NEXT:    v_or_b32_e32 v10, v10, v11
-; GFX9-NEXT:    v_add_u16_e32 v11, 3, v63
-; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v11, 3, v26
+; GFX9-NEXT:    v_or_b32_sdwa v11, v55, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v11
 ; GFX9-NEXT:    v_or_b32_e32 v11, v11, v12
-; GFX9-NEXT:    v_add_u16_e32 v12, 3, v48
-; GFX9-NEXT:    v_or_b32_sdwa v12, v55, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v12, 3, v61
+; GFX9-NEXT:    v_or_b32_sdwa v12, v51, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v12
 ; GFX9-NEXT:    v_or_b32_e32 v12, v12, v13
-; GFX9-NEXT:    v_add_u16_e32 v13, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v13, 3, v47
+; GFX9-NEXT:    v_or_b32_sdwa v13, v48, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v13
 ; GFX9-NEXT:    v_or_b32_e32 v13, v13, v14
-; GFX9-NEXT:    v_add_u16_e32 v14, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v14, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v14, v25, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; GFX9-NEXT:    v_or_b32_e32 v14, v14, v15
 ; GFX9-NEXT:    v_add_u16_e32 v15, 3, v19
-; GFX9-NEXT:    v_or_b32_sdwa v15, v21, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v15, v27, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v15
 ; GFX9-NEXT:    v_or_b32_e32 v15, v15, v16
 ; GFX9-NEXT:  .LBB43_4: ; %end
@@ -31721,44 +31757,44 @@ define <32 x i16> @bitcast_v32bf16_to_v32i16(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v30, 16, v11
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v24, 0x40c00000, v24 :: v_dual_lshlrev_b32 v25, 16, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v21, v17, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v19, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v22, 0x400000, v16
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v25, 0x40c00000, v25
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v21, v21, v17, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v19, v16, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v26, 0x40c00000, v26 :: v_dual_lshlrev_b32 v27, 16, v8
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v16, v19, v22, vcc_lo
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v25, 0x40c00000, v25 :: v_dual_add_f32 v6, 0x40c00000, v6
+; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v16, v19, v22 :: v_dual_lshlrev_b32 v27, 16, v8
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v19, 0x400000, v17
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v1, 0x40c00000, v1 :: v_dual_lshlrev_b32 v22, 16, v3
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v26, 0x40c00000, v26 :: v_dual_add_f32 v27, 0x40c00000, v27
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v0, 0x40c00000, v0
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v27, 0x40c00000, v27
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
-; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v29, 16, v10
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v28, 0x40c00000, v28 :: v_dual_lshlrev_b32 v29, 16, v10
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v20, v0, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v23, 0x400000, v0
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v28, 0x40c00000, v28 :: v_dual_add_f32 v29, 0x40c00000, v29
-; GFX11-FAKE16-NEXT:    v_add3_u32 v20, v20, v0, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX11-FAKE16-NEXT:    v_add3_u32 v20, v20, v0, 0x7fff
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v29, 0x40c00000, v29
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v30, 0x40c00000, v30 :: v_dual_lshlrev_b32 v31, 16, v12
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v0, v20, v23 :: v_dual_lshlrev_b32 v23, 16, v4
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v20, v1, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v10, 0x40c00000, v10 :: v_dual_add_f32 v23, 0x40c00000, v23
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v0, v0, v16, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v21, v19, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v20, v1, 0x7fff
@@ -31776,12 +31812,12 @@ define <32 x i16> @bitcast_v32bf16_to_v32i16(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v21, v18, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v20, 0x400000, v18
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v33, 0x400000, v31
+; GFX11-FAKE16-NEXT:    v_bfe_u32 v34, v12, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v21, v18, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v21, v2, 16, 1
-; GFX11-FAKE16-NEXT:    v_bfe_u32 v34, v12, 16, 1
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v1, v17, 0x7060302
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v18, v19, v20 :: v_dual_and_b32 v7, 0xffff0000, v7
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v21, v2, 0x7fff
@@ -34221,41 +34257,25 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:64
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:120
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:116
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:112
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:108
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:104
-; GCN-NEXT:    s_waitcnt vmcnt(5)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v39
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:40
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v5, 24, v7
 ; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
@@ -34263,7 +34283,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v13
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v15
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
@@ -34272,7 +34292,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v21
-; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v23
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
@@ -34297,52 +34317,70 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v25
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v38
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:120
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:116
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:104
+; GCN-NEXT:    s_waitcnt vmcnt(6)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v51
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v34
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v39
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v33
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v35
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v36
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v35
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v32
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v43, 8, v33
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v16
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:68
+; GCN-NEXT:    s_waitcnt vmcnt(5)
+; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v3
+; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:92
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v34
+; GCN-NEXT:    s_waitcnt vmcnt(5) expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v7
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v8
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v5
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v49
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 24, v9
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v53
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v36
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:100
 ; GCN-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:124
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v32
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v31
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt vmcnt(10)
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v11
 ; GCN-NEXT:    ; implicit-def: $vgpr32
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr33
@@ -34381,15 +34419,15 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v4
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v12
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v20
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v5
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v28
@@ -34404,33 +34442,33 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v18
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v30
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v26
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v56
-; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v46
-; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v47
-; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v62
-; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v0
-; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v61
-; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v41
-; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v63
-; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v40
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v42
+; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v41
+; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v43
+; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v40
+; GCN-NEXT:    v_and_b32_e32 v26, 0xff, v62
+; GCN-NEXT:    v_and_b32_e32 v27, 0xff, v63
+; GCN-NEXT:    v_and_b32_e32 v28, 0xff, v59
+; GCN-NEXT:    v_and_b32_e32 v29, 0xff, v58
+; GCN-NEXT:    v_and_b32_e32 v30, 0xff, v57
 ; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v24
 ; GCN-NEXT:    v_and_b32_e32 v31, 0xff, v16
-; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v58
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v32, 0xff, v56
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:324 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v33, 0xff, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:320 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v34, 0xff, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:316 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v0
-; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v35, 0xff, v10
+; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v0
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v59
-; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v60
-; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v42
+; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v10
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v61
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v0
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v60
 ; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v8
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v2
@@ -34450,10 +34488,10 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_or_b32_e32 v20, v25, v20
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v23, v28, v43
+; GCN-NEXT:    v_or_b32_e32 v23, v28, v44
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v29
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v30
-; GCN-NEXT:    v_or_b32_e32 v24, v24, v57
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v47
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v31
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v32
 ; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
@@ -34471,11 +34509,11 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v13, v32
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v32
-; GCN-NEXT:    v_or_b32_e32 v15, v15, v44
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v45
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v45
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v46
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -34592,64 +34630,26 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr26
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr30
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr47
-; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr46
-; GCN-NEXT:    ; implicit-def: $vgpr60
 ; GCN-NEXT:    ; implicit-def: $vgpr61
-; GCN-NEXT:    ; implicit-def: $vgpr62
-; GCN-NEXT:    ; implicit-def: $vgpr0
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
-; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr0
 ; GCN-NEXT:    ; implicit-def: $vgpr63
-; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr40
+; GCN-NEXT:    ; implicit-def: $vgpr62
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr59
 ; GCN-NEXT:    ; implicit-def: $vgpr58
+; GCN-NEXT:    ; implicit-def: $vgpr8
+; GCN-NEXT:    ; implicit-def: $vgpr56
 ; GCN-NEXT:    ; implicit-def: $vgpr24
 ; GCN-NEXT:    ; implicit-def: $vgpr16
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; kill: killed $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; kill: killed $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; kill: killed $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr44
@@ -34667,7 +34667,45 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr44
 ; GCN-NEXT:    ; kill: killed $vgpr44
 ; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; kill: killed $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; kill: killed $vgpr45
 ; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:  .LBB49_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB49_4
@@ -34675,31 +34713,31 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(2) expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v8
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v45, v1
+; GCN-NEXT:    v_or_b32_e32 v1, v46, v1
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v24
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v57, v3
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v42
+; GCN-NEXT:    v_or_b32_e32 v3, v47, v3
+; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v60
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
-; GCN-NEXT:    v_or_b32_e32 v5, v44, v5
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v41
+; GCN-NEXT:    v_or_b32_e32 v5, v45, v5
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v59
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v43, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v44, v7
 ; GCN-NEXT:    s_movk_i32 s7, 0x300
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v58
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v56
 ; GCN-NEXT:    s_mov_b32 s6, 0x3000000
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v16
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v40
-; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v60
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v61
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v0
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v59
-; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v47
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v56
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v46
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v57
+; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v58
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 3, v0
+; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v40
+; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v16, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v42
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v41
 ; GCN-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v0
@@ -34756,7 +34794,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v2
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v4
 ; GCN-NEXT:    v_lshlrev_b32_e32 v0, 16, v6
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v4, v32
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v9
@@ -34784,7 +34822,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v26, v20
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
@@ -34792,7 +34830,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v26, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v27, v12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
@@ -34800,7 +34838,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v27, v28
 ; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v29
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v29, v30
 ; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v31
@@ -35006,78 +35044,78 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:88
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:96
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
 ; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:4
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:116
-; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:124
-; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v7
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v9
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v11
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v13
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v15
+; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v29
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v27, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:116
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v1
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
+; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
+; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v13
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v15
 ; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v17
 ; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
 ; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v16
-; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v18
-; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v20
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v10
+; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v12
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v14
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v16
+; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v18
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v20
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v22
+; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v22
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v26
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v24
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v28
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v26
+; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:124
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v30
+; VI-NEXT:    v_lshlrev_b16_e32 v36, 8, v30
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v34, 8, v31
+; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v31
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v34, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:84
 ; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:68
 ; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:52
+; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v28
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v32
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
@@ -35089,79 +35127,80 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; VI-NEXT:    v_or_b32_sdwa v9, v39, v47 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v20, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v28, v59 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v22, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v30, v16 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v35, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v31, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v38, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr39
-; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr28
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v29, v47 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v41, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v46, v59 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v22, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v30, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v34, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v31, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v16, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr22
 ; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr34
 ; VI-NEXT:    ; implicit-def: $vgpr31
-; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr59
-; VI-NEXT:    ; implicit-def: $vgpr62
-; VI-NEXT:    ; implicit-def: $vgpr18
-; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr36
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_or_b32_sdwa v0, v0, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v0, v0, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v51 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v53 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    ; implicit-def: $vgpr49
 ; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr23
-; VI-NEXT:    ; implicit-def: $vgpr27
+; VI-NEXT:    ; implicit-def: $vgpr40
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v45 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v52 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v4, v4, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr52
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr40
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -35178,23 +35217,23 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr25
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v42 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v49, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v27, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v48, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v55, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v55, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v36, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v18, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v43, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v26, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v37, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v37, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v44, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v35, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v61, v34 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -35228,98 +35267,92 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr36
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr37
 ; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr29
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr18
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr37
+; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr42
+; VI-NEXT:    ; implicit-def: $vgpr45
 ; VI-NEXT:    ; implicit-def: $vgpr56
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr34
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr20
+; VI-NEXT:    ; implicit-def: $vgpr28
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:  .LBB49_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB49_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v38
-; VI-NEXT:    v_add_u16_e32 v2, 3, v44
+; VI-NEXT:    v_add_u16_e32 v0, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v0, v32, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v14, v26, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_add_u16_e32 v2, 3, v35
 ; VI-NEXT:    v_mov_b32_e32 v3, 0x300
-; VI-NEXT:    v_or_b32_sdwa v2, v18, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v18, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v37
-; VI-NEXT:    v_or_b32_sdwa v24, v24, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v16, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_add_u16_e32 v0, 3, v37
+; VI-NEXT:    v_or_b32_sdwa v20, v20, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v0, v16, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v38, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v43
-; VI-NEXT:    v_or_b32_sdwa v16, v63, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v22
-; VI-NEXT:    v_or_b32_sdwa v0, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v11, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v36
-; VI-NEXT:    v_or_b32_sdwa v22, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v26, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v28
+; VI-NEXT:    v_add_u16_e32 v0, 3, v22
+; VI-NEXT:    v_or_b32_sdwa v0, v61, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v11, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_add_u16_e32 v0, 3, v18
+; VI-NEXT:    v_or_b32_sdwa v18, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v0, v59, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v55
-; VI-NEXT:    v_or_b32_sdwa v28, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v20
+; VI-NEXT:    v_add_u16_e32 v0, 3, v44
+; VI-NEXT:    v_or_b32_sdwa v22, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v0, v57, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v2, 3, v35
 ; VI-NEXT:    v_add_u16_sdwa v9, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v20, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v0, 3, v39
+; VI-NEXT:    v_add_u16_e32 v0, 3, v55
+; VI-NEXT:    v_or_b32_sdwa v14, v28, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v28, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v0, v47, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v30, v46, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v27
+; VI-NEXT:    v_or_b32_sdwa v27, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v2, 3, v34
+; VI-NEXT:    v_or_b32_sdwa v2, v24, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v1, 3, v61
-; VI-NEXT:    v_or_b32_sdwa v15, v34, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v63
+; VI-NEXT:    v_or_b32_sdwa v15, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v1, 3, v31
-; VI-NEXT:    v_or_b32_sdwa v1, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v26, v1, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
-; VI-NEXT:    v_or_b32_e32 v12, v16, v12
-; VI-NEXT:    v_add_u16_e32 v16, 0x300, v24
+; VI-NEXT:    v_or_b32_sdwa v1, v36, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v24, v1, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
-; VI-NEXT:    v_or_b32_e32 v13, v16, v13
-; VI-NEXT:    v_or_b32_e32 v14, v14, v26
-; VI-NEXT:    v_or_b32_e32 v15, v15, v18
+; VI-NEXT:    v_or_b32_e32 v14, v14, v24
+; VI-NEXT:    v_or_b32_e32 v15, v15, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v29, v29, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v29, v42, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v27, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -35352,54 +35385,58 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v6, v17, v6
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v29
 ; VI-NEXT:    v_or_b32_e32 v7, v17, v7
-; VI-NEXT:    v_add_u16_e32 v17, 0x300, v30
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v27
 ; VI-NEXT:    v_or_b32_e32 v8, v17, v8
-; VI-NEXT:    v_add_u16_e32 v17, 0x300, v20
-; VI-NEXT:    v_or_b32_e32 v9, v17, v9
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v28
-; VI-NEXT:    v_or_b32_e32 v10, v17, v10
+; VI-NEXT:    v_or_b32_e32 v9, v17, v9
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v22
+; VI-NEXT:    v_or_b32_e32 v10, v17, v10
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v18
 ; VI-NEXT:    v_or_b32_e32 v11, v17, v11
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v26
+; VI-NEXT:    v_or_b32_e32 v12, v17, v12
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v20
+; VI-NEXT:    v_or_b32_e32 v13, v17, v13
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v42, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v23, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v23, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v2, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v27, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v30, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v1, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v31, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v31, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v48, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v0, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v3, v50, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v39, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v3, v0
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v31
 ; VI-NEXT:    v_or_b32_e32 v1, v3, v1
-; VI-NEXT:    v_add_u16_e32 v3, 0x300, v27
+; VI-NEXT:    v_add_u16_e32 v3, 0x300, v30
 ; VI-NEXT:    v_or_b32_e32 v2, v3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v23
 ; VI-NEXT:    v_or_b32_e32 v3, v3, v19
@@ -35459,99 +35496,100 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:80
 ; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:88
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:96
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v33, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v7
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v9
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v11
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v13
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v15
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v17
+; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v17
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v19
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v23
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v25
+; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
-; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v29
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v27, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v25, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v1
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v9
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v11
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v13
+; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v15
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
+; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v4
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v10
+; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v12
+; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v16
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v18
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v20
+; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v24
-; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v28
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v31
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v32
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v34, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v33
+; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v31
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v32
+; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB49_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
@@ -35559,108 +35597,107 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v39, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v35, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v34, v36 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v16, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v32, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v38, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr39
-; GFX9-NEXT:    ; implicit-def: $vgpr62
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr35
-; GFX9-NEXT:    ; implicit-def: $vgpr34
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr38
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v25, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v27, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(10)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v26, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v37, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v20, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v31, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr25
+; GFX9-NEXT:    ; implicit-def: $vgpr27
+; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr37
+; GFX9-NEXT:    ; implicit-def: $vgpr20
+; GFX9-NEXT:    ; implicit-def: $vgpr31
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr36
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr18
+; GFX9-NEXT:    ; implicit-def: $vgpr30
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s6
 ; GFX9-NEXT:    v_perm_b32 v1, v3, v2, s6
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr49
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr39
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr53
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr23
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr27
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v2, v3, v2, s6
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    ; implicit-def: $vgpr52
+; GFX9-NEXT:    ; implicit-def: $vgpr51
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v3, v4, v3, s6
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr43
+; GFX9-NEXT:    ; implicit-def: $vgpr55
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v5, v6, v5, s6
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v6, v7, v6, s6
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr25
+; GFX9-NEXT:    ; implicit-def: $vgpr40
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v7, v8, v7, s6
-; GFX9-NEXT:    v_or_b32_sdwa v8, v51, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v29, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
-; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v57 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v42, v57 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
-; GFX9-NEXT:    v_or_b32_sdwa v10, v42, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v47, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    v_or_b32_sdwa v11, v37, v61 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v35, v61 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    v_or_b32_sdwa v12, v44, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v16, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
-; GFX9-NEXT:    v_or_b32_sdwa v13, v28, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v24, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    v_or_b32_sdwa v14, v47, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v34 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    v_or_b32_sdwa v15, v18, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v38, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v36, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -35694,111 +35731,110 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr52
+; GFX9-NEXT:    ; implicit-def: $vgpr29
 ; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr37
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr28
-; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr18
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; implicit-def: $vgpr38
+; GFX9-NEXT:    ; implicit-def: $vgpr36
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr22
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr31
+; GFX9-NEXT:    ; implicit-def: $vgpr34
+; GFX9-NEXT:    ; implicit-def: $vgpr28
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:  .LBB49_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB49_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
-; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v18
-; GFX9-NEXT:    v_or_b32_sdwa v0, v31, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v0, v28, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v28
-; GFX9-NEXT:    v_or_b32_sdwa v0, v30, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v24
+; GFX9-NEXT:    v_or_b32_sdwa v0, v22, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v20
+; GFX9-NEXT:    v_or_b32_sdwa v0, v18, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v0
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v16
-; GFX9-NEXT:    v_or_b32_sdwa v0, v26, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v0, v22, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v63, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v34
-; GFX9-NEXT:    v_or_b32_sdwa v0, v36, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v37
+; GFX9-NEXT:    v_or_b32_sdwa v0, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v35
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v61, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v35
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v26
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v24, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v42
+; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v47
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v59, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v42
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v57, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v51
+; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v46, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_add_u16_e32 v3, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v3, v20, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v3
+; GFX9-NEXT:    v_add_u16_e32 v3, 3, v31
+; GFX9-NEXT:    v_or_b32_sdwa v3, v30, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v3
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v32
+; GFX9-NEXT:    v_or_b32_sdwa v2, v34, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v2
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v38
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v36
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v1
+; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v1
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
-; GFX9-NEXT:    v_perm_b32 v8, v30, v8, s6
-; GFX9-NEXT:    v_perm_b32 v9, v28, v9, s6
-; GFX9-NEXT:    v_perm_b32 v10, v26, v10, s6
-; GFX9-NEXT:    v_perm_b32 v11, v24, v11, s6
-; GFX9-NEXT:    v_perm_b32 v12, v22, v12, s6
-; GFX9-NEXT:    v_perm_b32 v13, v16, v13, s6
-; GFX9-NEXT:    v_perm_b32 v14, v20, v14, s6
-; GFX9-NEXT:    v_perm_b32 v15, v18, v15, s6
+; GFX9-NEXT:    v_perm_b32 v8, v25, v8, s6
+; GFX9-NEXT:    v_perm_b32 v9, v24, v9, s6
+; GFX9-NEXT:    v_perm_b32 v10, v22, v10, s6
+; GFX9-NEXT:    v_perm_b32 v11, v20, v11, s6
+; GFX9-NEXT:    v_perm_b32 v12, v16, v12, s6
+; GFX9-NEXT:    v_perm_b32 v13, v18, v13, s6
+; GFX9-NEXT:    v_perm_b32 v14, v30, v14, s6
+; GFX9-NEXT:    v_perm_b32 v15, v28, v15, s6
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v29, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v44, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
-; GFX9-NEXT:    v_or_b32_sdwa v31, v48, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v31, v39, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v27, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v27, 0x300, v0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_perm_b32 v7, v27, v7, s6
+; GFX9-NEXT:    v_perm_b32 v7, v26, v7, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v25, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v6, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -35820,7 +35856,7 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v5, v19, v5, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v55, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -35831,40 +35867,40 @@ define <32 x i16> @bitcast_v64i8_to_v32i16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v4, v17, v4, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v3, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v3, v21, v3, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v55, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v27, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_perm_b32 v2, v25, v2, s6
+; GFX9-NEXT:    v_perm_b32 v2, v27, v2, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v1, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v29, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v1, v29, v1, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v48, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_perm_b32 v0, v31, v0, s6
 ; GFX9-NEXT:  .LBB49_4: ; %end
@@ -38344,44 +38380,44 @@ define <32 x half> @bitcast_v32bf16_to_v32f16(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v30, 16, v11
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v24, 0x40c00000, v24 :: v_dual_lshlrev_b32 v25, 16, v6
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v21, v17, 16, 1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v19, v16, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v22, 0x400000, v16
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v25, 0x40c00000, v25
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v16, v16
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v21, v21, v17, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v19, v16, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v26, 0x40c00000, v26 :: v_dual_lshlrev_b32 v27, 16, v8
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v16, v19, v22, vcc_lo
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v25, 0x40c00000, v25 :: v_dual_add_f32 v6, 0x40c00000, v6
+; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v16, v19, v22 :: v_dual_lshlrev_b32 v27, 16, v8
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v19, 0x400000, v17
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v1, 0x40c00000, v1 :: v_dual_lshlrev_b32 v22, 16, v3
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v6, 0x40c00000, v6
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v26, 0x40c00000, v26 :: v_dual_add_f32 v27, 0x40c00000, v27
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v0, 0x40c00000, v0
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v27, 0x40c00000, v27
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v3, 0x40c00000, v3
-; GFX11-FAKE16-NEXT:    v_lshlrev_b32_e32 v29, 16, v10
+; GFX11-FAKE16-NEXT:    v_dual_add_f32 v28, 0x40c00000, v28 :: v_dual_lshlrev_b32 v29, 16, v10
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v20, v0, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v23, 0x400000, v0
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-FAKE16-NEXT:    v_add_f32_e32 v8, 0x40c00000, v8
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v28, 0x40c00000, v28 :: v_dual_add_f32 v29, 0x40c00000, v29
-; GFX11-FAKE16-NEXT:    v_add3_u32 v20, v20, v0, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX11-FAKE16-NEXT:    v_add3_u32 v20, v20, v0, 0x7fff
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v29, 0x40c00000, v29
 ; GFX11-FAKE16-NEXT:    v_dual_add_f32 v30, 0x40c00000, v30 :: v_dual_lshlrev_b32 v31, 16, v12
-; GFX11-FAKE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v10, 0x40c00000, v10
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v0, v20, v23 :: v_dual_lshlrev_b32 v23, 16, v4
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v20, v1, 16, 1
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX11-FAKE16-NEXT:    v_dual_add_f32 v10, 0x40c00000, v10 :: v_dual_add_f32 v23, 0x40c00000, v23
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v23, 0x40c00000, v23
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v0, v0, v16, 0x7060302
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v17, v21, v19, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v20, v1, 0x7fff
@@ -38399,12 +38435,12 @@ define <32 x half> @bitcast_v32bf16_to_v32f16(<32 x bfloat> %a, i32 %b) {
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v21, v18, 16, 1
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v20, 0x400000, v18
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX11-FAKE16-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v33, 0x400000, v31
+; GFX11-FAKE16-NEXT:    v_bfe_u32 v34, v12, 16, 1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v21, v18, 0x7fff
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v21, v2, 16, 1
-; GFX11-FAKE16-NEXT:    v_bfe_u32 v34, v12, 16, 1
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v1, v17, 0x7060302
+; GFX11-FAKE16-NEXT:    v_add_f32_e32 v5, 0x40c00000, v5
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_dual_cndmask_b32 v18, v19, v20 :: v_dual_and_b32 v7, 0xffff0000, v7
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v19, v21, v2, 0x7fff
@@ -40676,72 +40712,61 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:316 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v26, off, s[0:3], s32 offset:320 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v24, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:112
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:104
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:96
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:48
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:132
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:128
-; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:124
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:120
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v39
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:88
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:84
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:76
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:12
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v3
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v5
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v9
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v11
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v13
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v15
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:228 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v17
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:232 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v19
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v21
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:240 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v23
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
@@ -40755,52 +40780,63 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v29
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:132
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
+; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:124
+; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:120
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v38
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v37
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:220 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt vmcnt(14) expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v52
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v51
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:236 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v50
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v39
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v36
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:260 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v35
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:264 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v34
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:276 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v33
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:280 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v32
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:284 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v31
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v30
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:288 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:300 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    v_lshlrev_b32_e32 v1, 8, v28
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:292 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 8, v26
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v60, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:304 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v59, off, s[0:3], s32 offset:92
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 8, v26
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:308 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v1
 ; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:116
 ; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:108
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 8, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v46, 8, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v47, 8, v9
+; GCN-NEXT:    v_lshlrev_b32_e32 v57, 8, v11
+; GCN-NEXT:    v_lshlrev_b32_e32 v58, 8, v7
 ; GCN-NEXT:    ; implicit-def: $vgpr55
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr39
@@ -40838,20 +40874,20 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    s_cbranch_execz .LBB53_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.false
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v0, v0, v1
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v2
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v4
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v6
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
 ; GCN-NEXT:    v_and_b32_e32 v25, 0xff, v8
@@ -40874,41 +40910,41 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v8, 0xff, v8
-; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v43
-; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v60
-; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v58
-; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v57
-; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v56
-; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v30
-; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v42
-; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v40
-; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v63
-; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v62
-; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v61
-; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v24
-; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v41
+; GCN-NEXT:    v_and_b32_e32 v9, 0xff, v56
+; GCN-NEXT:    v_and_b32_e32 v10, 0xff, v45
+; GCN-NEXT:    v_and_b32_e32 v11, 0xff, v44
+; GCN-NEXT:    v_and_b32_e32 v12, 0xff, v43
+; GCN-NEXT:    v_and_b32_e32 v13, 0xff, v42
+; GCN-NEXT:    v_and_b32_e32 v14, 0xff, v41
+; GCN-NEXT:    v_and_b32_e32 v15, 0xff, v40
+; GCN-NEXT:    v_and_b32_e32 v16, 0xff, v63
+; GCN-NEXT:    v_and_b32_e32 v17, 0xff, v62
+; GCN-NEXT:    v_and_b32_e32 v18, 0xff, v61
+; GCN-NEXT:    v_and_b32_e32 v19, 0xff, v24
+; GCN-NEXT:    v_and_b32_e32 v20, 0xff, v59
+; GCN-NEXT:    v_and_b32_e32 v21, 0xff, v30
 ; GCN-NEXT:    v_and_b32_e32 v22, 0xff, v28
 ; GCN-NEXT:    v_and_b32_e32 v23, 0xff, v26
-; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v59
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v24, 0xff, v60
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v25, v26
-; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v27, v26
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v29, v27
-; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v31, v28
-; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v29, v32, v29
-; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v30, v33, v30
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v31, v34, v31
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
@@ -40923,49 +40959,49 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v34, v7, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v8, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v36, v9, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v10, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v38, v11, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v12, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v48, v13, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v14, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v50, v15, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v16, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v52, v17, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v54, v19, v5
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v5
-; GCN-NEXT:    v_or_b32_e32 v40, v21, v44
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v45
-; GCN-NEXT:    v_or_b32_e32 v41, v23, v46
-; GCN-NEXT:    v_or_b32_e32 v24, v24, v47
+; GCN-NEXT:    v_or_b32_e32 v40, v21, v46
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v47
+; GCN-NEXT:    v_or_b32_e32 v41, v23, v57
+; GCN-NEXT:    v_or_b32_e32 v24, v24, v58
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v0
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v2
@@ -41018,114 +41054,113 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    ; kill: killed $vgpr24
 ; GCN-NEXT:    ; implicit-def: $vgpr24
 ; GCN-NEXT:    ; kill: killed $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr43
-; GCN-NEXT:    ; implicit-def: $vgpr60
-; GCN-NEXT:    ; implicit-def: $vgpr58
-; GCN-NEXT:    ; implicit-def: $vgpr57
 ; GCN-NEXT:    ; implicit-def: $vgpr56
-; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr44
+; GCN-NEXT:    ; implicit-def: $vgpr43
 ; GCN-NEXT:    ; implicit-def: $vgpr42
+; GCN-NEXT:    ; implicit-def: $vgpr41
 ; GCN-NEXT:    ; implicit-def: $vgpr40
 ; GCN-NEXT:    ; implicit-def: $vgpr63
 ; GCN-NEXT:    ; implicit-def: $vgpr62
 ; GCN-NEXT:    ; implicit-def: $vgpr61
 ; GCN-NEXT:    ; implicit-def: $vgpr24
-; GCN-NEXT:    ; implicit-def: $vgpr41
+; GCN-NEXT:    ; implicit-def: $vgpr59
+; GCN-NEXT:    ; implicit-def: $vgpr30
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr26
-; GCN-NEXT:    ; implicit-def: $vgpr59
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; kill: killed $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr44
-; GCN-NEXT:    ; implicit-def: $vgpr45
+; GCN-NEXT:    ; implicit-def: $vgpr60
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
+; GCN-NEXT:    ; implicit-def: $vgpr46
+; GCN-NEXT:    ; kill: killed $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr46
 ; GCN-NEXT:    ; implicit-def: $vgpr47
+; GCN-NEXT:    ; implicit-def: $vgpr57
+; GCN-NEXT:    ; implicit-def: $vgpr58
 ; GCN-NEXT:  .LBB53_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB53_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v59
+; GCN-NEXT:    v_add_i32_e32 v1, vcc, 3, v60
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xff, v1
-; GCN-NEXT:    v_or_b32_e32 v1, v47, v1
-; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_or_b32_e32 v1, v58, v1
+; GCN-NEXT:    s_waitcnt vmcnt(1) expcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v3, vcc, 3, v26
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xff, v3
-; GCN-NEXT:    v_or_b32_e32 v3, v46, v3
+; GCN-NEXT:    v_or_b32_e32 v3, v57, v3
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v5, vcc, 3, v28
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xff, v5
-; GCN-NEXT:    v_or_b32_e32 v5, v45, v5
-; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v41
+; GCN-NEXT:    v_or_b32_e32 v5, v47, v5
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, 3, v30
 ; GCN-NEXT:    v_and_b32_e32 v7, 0xff, v7
-; GCN-NEXT:    v_or_b32_e32 v7, v44, v7
+; GCN-NEXT:    v_or_b32_e32 v7, v46, v7
 ; GCN-NEXT:    s_movk_i32 s6, 0x300
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v24
-; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v61
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v62
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v63
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v40
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v42
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v30
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v56
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v57
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v58
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v60
-; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v9, vcc, 3, v59
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 3, v24
+; GCN-NEXT:    v_add_i32_e32 v13, vcc, 3, v61
+; GCN-NEXT:    v_add_i32_e32 v15, vcc, 3, v62
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 3, v63
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 3, v40
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 3, v41
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 3, v42
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 3, v43
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 3, v44
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 3, v45
+; GCN-NEXT:    v_add_i32_e32 v27, vcc, 3, v56
 ; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:312 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
@@ -41178,43 +41213,43 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GCN-NEXT:    v_and_b32_e32 v2, 0xff, v2
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xff, v0
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v9, v32, v9
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v11, v32, v11
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v13, v32, v13
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v15, v32, v15
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:292 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v17, v32, v17
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:288 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v19, v32, v19
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:284 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v21, v32, v21
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v23, v32, v23
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v24, v32, v24
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v25, v32, v25
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v26, v32, v26
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v27, v32, v27
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v28, v32, v28
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
@@ -41229,37 +41264,37 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v22, v32, v22
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:264 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v20, v32, v20
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v18, v32, v18
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v16, v32, v16
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v14, v32, v14
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v12, v32, v12
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v10, v32, v10
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v8, v32, v8
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:308 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:280 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v6, v32, v6
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:304 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:276 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v4, v32, v4
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:300 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:272 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v2, v32, v2
-; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:296 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:268 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_or_b32_e32 v0, v32, v0
 ; GCN-NEXT:    v_add_i32_e32 v40, vcc, 0x300, v1
@@ -41400,78 +41435,78 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:88
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:96
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
 ; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:4
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:116
-; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:124
-; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v7
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v9
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v11
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v13
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v15
+; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v29
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v27, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:116
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v1
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
+; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
+; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v13
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v15
 ; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v17
 ; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
 ; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v16
-; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v18
-; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v20
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v10
+; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v12
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v14
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v16
+; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v18
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v20
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v22
+; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v22
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v26
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v24
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v28
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v26
+; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:124
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v30
+; VI-NEXT:    v_lshlrev_b16_e32 v36, 8, v30
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v34, 8, v31
+; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v31
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v34, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:84
 ; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:68
 ; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:52
+; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v28
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v32
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
@@ -41483,79 +41518,80 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; VI-NEXT:    v_or_b32_sdwa v9, v39, v47 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v20, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v28, v59 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v22, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v30, v16 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v35, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v31, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v38, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr39
-; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr28
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v29, v47 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v41, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v46, v59 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v22, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v30, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v34, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v31, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v16, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr22
 ; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr34
 ; VI-NEXT:    ; implicit-def: $vgpr31
-; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr59
-; VI-NEXT:    ; implicit-def: $vgpr62
-; VI-NEXT:    ; implicit-def: $vgpr18
-; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr36
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_or_b32_sdwa v0, v0, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v0, v0, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v51 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v53 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    ; implicit-def: $vgpr49
 ; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr23
-; VI-NEXT:    ; implicit-def: $vgpr27
+; VI-NEXT:    ; implicit-def: $vgpr40
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v45 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v52 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v4, v4, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr52
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr40
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -41572,23 +41608,23 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr25
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v42 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v49, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v27, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v48, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v55, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v55, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v36, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v18, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v43, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v26, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v37, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v37, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v44, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v35, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v61, v34 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -41622,98 +41658,92 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr36
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr37
 ; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr29
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr18
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr37
+; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr42
+; VI-NEXT:    ; implicit-def: $vgpr45
 ; VI-NEXT:    ; implicit-def: $vgpr56
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr34
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr20
+; VI-NEXT:    ; implicit-def: $vgpr28
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:  .LBB53_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB53_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v38
-; VI-NEXT:    v_add_u16_e32 v2, 3, v44
+; VI-NEXT:    v_add_u16_e32 v0, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v0, v32, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v14, v26, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_add_u16_e32 v2, 3, v35
 ; VI-NEXT:    v_mov_b32_e32 v3, 0x300
-; VI-NEXT:    v_or_b32_sdwa v2, v18, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v18, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v37
-; VI-NEXT:    v_or_b32_sdwa v24, v24, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v16, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_add_u16_e32 v0, 3, v37
+; VI-NEXT:    v_or_b32_sdwa v20, v20, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v0, v16, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v38, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v43
-; VI-NEXT:    v_or_b32_sdwa v16, v63, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v22
-; VI-NEXT:    v_or_b32_sdwa v0, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v11, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v36
-; VI-NEXT:    v_or_b32_sdwa v22, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v26, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v28
+; VI-NEXT:    v_add_u16_e32 v0, 3, v22
+; VI-NEXT:    v_or_b32_sdwa v0, v61, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v11, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_add_u16_e32 v0, 3, v18
+; VI-NEXT:    v_or_b32_sdwa v18, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v0, v59, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v55
-; VI-NEXT:    v_or_b32_sdwa v28, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v20
+; VI-NEXT:    v_add_u16_e32 v0, 3, v44
+; VI-NEXT:    v_or_b32_sdwa v22, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v0, v57, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v2, 3, v35
 ; VI-NEXT:    v_add_u16_sdwa v9, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v20, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v0, 3, v39
+; VI-NEXT:    v_add_u16_e32 v0, 3, v55
+; VI-NEXT:    v_or_b32_sdwa v14, v28, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v28, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v0, v47, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v30, v46, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v27
+; VI-NEXT:    v_or_b32_sdwa v27, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v2, 3, v34
+; VI-NEXT:    v_or_b32_sdwa v2, v24, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v1, 3, v61
-; VI-NEXT:    v_or_b32_sdwa v15, v34, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v63
+; VI-NEXT:    v_or_b32_sdwa v15, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v1, 3, v31
-; VI-NEXT:    v_or_b32_sdwa v1, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v26, v1, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
-; VI-NEXT:    v_or_b32_e32 v12, v16, v12
-; VI-NEXT:    v_add_u16_e32 v16, 0x300, v24
+; VI-NEXT:    v_or_b32_sdwa v1, v36, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v24, v1, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
-; VI-NEXT:    v_or_b32_e32 v13, v16, v13
-; VI-NEXT:    v_or_b32_e32 v14, v14, v26
-; VI-NEXT:    v_or_b32_e32 v15, v15, v18
+; VI-NEXT:    v_or_b32_e32 v14, v14, v24
+; VI-NEXT:    v_or_b32_e32 v15, v15, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v29, v29, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v29, v42, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v27, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -41746,54 +41776,58 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v6, v17, v6
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v29
 ; VI-NEXT:    v_or_b32_e32 v7, v17, v7
-; VI-NEXT:    v_add_u16_e32 v17, 0x300, v30
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v27
 ; VI-NEXT:    v_or_b32_e32 v8, v17, v8
-; VI-NEXT:    v_add_u16_e32 v17, 0x300, v20
-; VI-NEXT:    v_or_b32_e32 v9, v17, v9
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v28
-; VI-NEXT:    v_or_b32_e32 v10, v17, v10
+; VI-NEXT:    v_or_b32_e32 v9, v17, v9
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v22
+; VI-NEXT:    v_or_b32_e32 v10, v17, v10
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v18
 ; VI-NEXT:    v_or_b32_e32 v11, v17, v11
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v26
+; VI-NEXT:    v_or_b32_e32 v12, v17, v12
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v20
+; VI-NEXT:    v_or_b32_e32 v13, v17, v13
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v42, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v23, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v23, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v2, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v27, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v30, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v1, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v31, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v31, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v48, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v0, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v3, v50, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v39, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v3, v0
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v31
 ; VI-NEXT:    v_or_b32_e32 v1, v3, v1
-; VI-NEXT:    v_add_u16_e32 v3, 0x300, v27
+; VI-NEXT:    v_add_u16_e32 v3, 0x300, v30
 ; VI-NEXT:    v_or_b32_e32 v2, v3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v23
 ; VI-NEXT:    v_or_b32_e32 v3, v3, v19
@@ -41853,99 +41887,100 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:80
 ; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:88
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:96
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v33, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v7
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v9
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v11
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v13
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v15
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v17
+; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v17
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v19
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v23
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v25
+; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
-; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v29
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v27, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v25, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v1
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v9
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v11
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v13
+; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v15
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
+; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v4
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v10
+; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v12
+; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v16
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v18
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v20
+; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v24
-; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v28
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v31
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v32
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v34, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v33
+; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v31
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v32
+; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB53_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
@@ -41953,108 +41988,107 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v39, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v35, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v34, v36 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v16, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v32, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v38, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr39
-; GFX9-NEXT:    ; implicit-def: $vgpr62
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr35
-; GFX9-NEXT:    ; implicit-def: $vgpr34
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr38
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v25, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v27, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(10)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v26, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v37, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v20, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v31, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr25
+; GFX9-NEXT:    ; implicit-def: $vgpr27
+; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr37
+; GFX9-NEXT:    ; implicit-def: $vgpr20
+; GFX9-NEXT:    ; implicit-def: $vgpr31
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr36
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr18
+; GFX9-NEXT:    ; implicit-def: $vgpr30
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s6
 ; GFX9-NEXT:    v_perm_b32 v1, v3, v2, s6
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr49
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr39
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr53
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr23
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr27
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v2, v3, v2, s6
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    ; implicit-def: $vgpr52
+; GFX9-NEXT:    ; implicit-def: $vgpr51
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v3, v4, v3, s6
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr43
+; GFX9-NEXT:    ; implicit-def: $vgpr55
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v5, v6, v5, s6
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v6, v7, v6, s6
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr25
+; GFX9-NEXT:    ; implicit-def: $vgpr40
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v7, v8, v7, s6
-; GFX9-NEXT:    v_or_b32_sdwa v8, v51, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v29, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
-; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v57 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v42, v57 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
-; GFX9-NEXT:    v_or_b32_sdwa v10, v42, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v47, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    v_or_b32_sdwa v11, v37, v61 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v35, v61 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    v_or_b32_sdwa v12, v44, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v16, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
-; GFX9-NEXT:    v_or_b32_sdwa v13, v28, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v24, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    v_or_b32_sdwa v14, v47, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v34 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    v_or_b32_sdwa v15, v18, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v38, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v36, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -42088,111 +42122,110 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr52
+; GFX9-NEXT:    ; implicit-def: $vgpr29
 ; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr37
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr28
-; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr18
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; implicit-def: $vgpr38
+; GFX9-NEXT:    ; implicit-def: $vgpr36
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr22
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr31
+; GFX9-NEXT:    ; implicit-def: $vgpr34
+; GFX9-NEXT:    ; implicit-def: $vgpr28
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:  .LBB53_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB53_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
-; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v18
-; GFX9-NEXT:    v_or_b32_sdwa v0, v31, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v0, v28, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v28
-; GFX9-NEXT:    v_or_b32_sdwa v0, v30, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v24
+; GFX9-NEXT:    v_or_b32_sdwa v0, v22, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v20
+; GFX9-NEXT:    v_or_b32_sdwa v0, v18, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v0
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v16
-; GFX9-NEXT:    v_or_b32_sdwa v0, v26, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v0, v22, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v63, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v34
-; GFX9-NEXT:    v_or_b32_sdwa v0, v36, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v37
+; GFX9-NEXT:    v_or_b32_sdwa v0, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v35
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v61, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v35
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v26
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v24, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v42
+; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v47
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v59, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v42
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v57, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v51
+; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v46, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_add_u16_e32 v3, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v3, v20, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v3
+; GFX9-NEXT:    v_add_u16_e32 v3, 3, v31
+; GFX9-NEXT:    v_or_b32_sdwa v3, v30, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v3
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v32
+; GFX9-NEXT:    v_or_b32_sdwa v2, v34, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v2
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v38
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v36
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v1
+; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v1
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
-; GFX9-NEXT:    v_perm_b32 v8, v30, v8, s6
-; GFX9-NEXT:    v_perm_b32 v9, v28, v9, s6
-; GFX9-NEXT:    v_perm_b32 v10, v26, v10, s6
-; GFX9-NEXT:    v_perm_b32 v11, v24, v11, s6
-; GFX9-NEXT:    v_perm_b32 v12, v22, v12, s6
-; GFX9-NEXT:    v_perm_b32 v13, v16, v13, s6
-; GFX9-NEXT:    v_perm_b32 v14, v20, v14, s6
-; GFX9-NEXT:    v_perm_b32 v15, v18, v15, s6
+; GFX9-NEXT:    v_perm_b32 v8, v25, v8, s6
+; GFX9-NEXT:    v_perm_b32 v9, v24, v9, s6
+; GFX9-NEXT:    v_perm_b32 v10, v22, v10, s6
+; GFX9-NEXT:    v_perm_b32 v11, v20, v11, s6
+; GFX9-NEXT:    v_perm_b32 v12, v16, v12, s6
+; GFX9-NEXT:    v_perm_b32 v13, v18, v13, s6
+; GFX9-NEXT:    v_perm_b32 v14, v30, v14, s6
+; GFX9-NEXT:    v_perm_b32 v15, v28, v15, s6
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v29, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v44, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
-; GFX9-NEXT:    v_or_b32_sdwa v31, v48, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v31, v39, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v27, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v27, 0x300, v0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_perm_b32 v7, v27, v7, s6
+; GFX9-NEXT:    v_perm_b32 v7, v26, v7, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v25, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v6, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -42214,7 +42247,7 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v5, v19, v5, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v55, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -42225,40 +42258,40 @@ define <32 x half> @bitcast_v64i8_to_v32f16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v4, v17, v4, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v3, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v3, v21, v3, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v55, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v27, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_perm_b32 v2, v25, v2, s6
+; GFX9-NEXT:    v_perm_b32 v2, v27, v2, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v1, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v29, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v1, v29, v1, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v48, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_perm_b32 v0, v31, v0, s6
 ; GFX9-NEXT:  .LBB53_4: ; %end
@@ -45359,27 +45392,28 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v20, v9, 16, 1
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v23, 0x400000, v9
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v22, 0xffff0000, v14
-; GFX11-TRUE16-NEXT:    v_dual_cndmask_b32 v65, v19, v25 :: v_dual_lshlrev_b32 v14, 16, v14
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v65, v19, v25, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v20, v20, v9, 0x7fff
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v19, v21, 16, 1
-; GFX11-TRUE16-NEXT:    v_dual_add_f32 v14, 0x40c00000, v14 :: v_dual_lshlrev_b32 v11, 16, v11
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v53, v24, v50, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
+; GFX11-TRUE16-NEXT:    v_add_f32_e32 v14, 0x40c00000, v14
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v71, 24, v10
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v80, 8, v10
-; GFX11-TRUE16-NEXT:    v_bfe_u32 v25, v14, 16, 1
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v12, 0xffff, v7, v53
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v9, v20, v23, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v7, 0x40c00000, v11
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v11, v19, v21, 0x7fff
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v19, 0x400000, v21
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v20.l, v65.h
+; GFX11-TRUE16-NEXT:    v_bfe_u32 v25, v14, 16, 1
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v23, v7, 16, 1
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v20.l, v65.h
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v64, 24, v12
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 8, v12
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v11, v11, v19, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v19, 0x40c00000, v22
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v21, v23, v7, 0x7fff
@@ -45407,7 +45441,7 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v68, v23, v24, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v23, v13, 16, 1
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v83, 8, v9
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v66, 8, v12
 ; GFX11-TRUE16-NEXT:    v_add_f32_e32 v16, 0x40c00000, v16
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v22.l, v68.h
 ; GFX11-TRUE16-NEXT:    v_dual_cndmask_b32 v7, v14, v19 :: v_dual_add_f32 v14, 0x40c00000, v21
@@ -45449,7 +45483,7 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v16, 0xffff, v19, v82
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v54, 8, v14
 ; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v69, 8, v11
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v83, 8, v9
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v15, 0xffff, v15, v13
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v13, 0xffff, v21, v7
 ; GFX11-TRUE16-NEXT:    v_bfi_b32 v7, 0xffff, v18, v17
@@ -46254,7 +46288,7 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:328 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:324 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:88
 ; GCN-NEXT:    s_waitcnt expcnt(2)
 ; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:80
 ; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:76
@@ -46333,17 +46367,18 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v2, 8, v25
 ; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:272 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:88
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 24, v24
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v24
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v10
+; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 24, v10
-; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:96
 ; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:84
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 8, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 8, v17
 ; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:92
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v17
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GCN-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:104
 ; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:100
@@ -47031,78 +47066,78 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:248 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; VI-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Spill
-; VI-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:132
-; VI-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
+; VI-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; VI-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:88
 ; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:96
 ; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:104
 ; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:112
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
 ; VI-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_ushort v48, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_ushort v43, off, s[0:3], s32 offset:68
-; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:36
-; VI-NEXT:    buffer_load_ushort v49, off, s[0:3], s32 offset:4
-; VI-NEXT:    buffer_load_ushort v61, off, s[0:3], s32 offset:116
-; VI-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:124
-; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v1
-; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v3
-; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v5
-; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v7
-; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v9
-; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v11
-; VI-NEXT:    v_lshlrev_b16_e32 v41, 8, v13
-; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v15
+; VI-NEXT:    v_lshlrev_b16_e32 v40, 8, v27
+; VI-NEXT:    v_lshlrev_b16_e32 v42, 8, v29
+; VI-NEXT:    buffer_load_ushort v46, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v55, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ushort v27, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:116
+; VI-NEXT:    v_lshlrev_b16_e32 v39, 8, v1
+; VI-NEXT:    v_lshlrev_b16_e32 v48, 8, v3
+; VI-NEXT:    v_lshlrev_b16_e32 v49, 8, v5
+; VI-NEXT:    v_lshlrev_b16_e32 v50, 8, v7
+; VI-NEXT:    v_lshlrev_b16_e32 v51, 8, v9
+; VI-NEXT:    v_lshlrev_b16_e32 v52, 8, v11
+; VI-NEXT:    v_lshlrev_b16_e32 v53, 8, v13
+; VI-NEXT:    v_lshlrev_b16_e32 v54, 8, v15
 ; VI-NEXT:    v_lshlrev_b16_e32 v17, 8, v17
 ; VI-NEXT:    v_lshlrev_b16_e32 v19, 8, v19
 ; VI-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
 ; VI-NEXT:    v_lshlrev_b16_e32 v23, 8, v23
 ; VI-NEXT:    v_lshlrev_b16_e32 v25, 8, v25
-; VI-NEXT:    v_lshlrev_b16_e32 v27, 8, v27
-; VI-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
 ; VI-NEXT:    s_waitcnt vmcnt(14)
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v24
-; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v0
-; VI-NEXT:    v_lshlrev_b16_e32 v46, 8, v2
-; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v4
-; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v6
-; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v8
-; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v10
-; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v12
-; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v14
-; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v16
-; VI-NEXT:    v_lshlrev_b16_e32 v63, 8, v18
-; VI-NEXT:    v_lshlrev_b16_e32 v16, 8, v20
+; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; VI-NEXT:    v_lshlrev_b16_e32 v43, 8, v2
+; VI-NEXT:    v_lshlrev_b16_e32 v45, 8, v4
+; VI-NEXT:    v_lshlrev_b16_e32 v47, 8, v6
+; VI-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
+; VI-NEXT:    v_lshlrev_b16_e32 v57, 8, v10
+; VI-NEXT:    v_lshlrev_b16_e32 v58, 8, v12
+; VI-NEXT:    v_lshlrev_b16_e32 v59, 8, v14
+; VI-NEXT:    v_lshlrev_b16_e32 v60, 8, v16
+; VI-NEXT:    v_lshlrev_b16_e32 v61, 8, v18
+; VI-NEXT:    v_lshlrev_b16_e32 v62, 8, v20
 ; VI-NEXT:    s_waitcnt vmcnt(13)
-; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v22
+; VI-NEXT:    v_lshlrev_b16_e32 v38, 8, v22
 ; VI-NEXT:    s_waitcnt vmcnt(12)
-; VI-NEXT:    v_lshlrev_b16_e32 v18, 8, v26
+; VI-NEXT:    v_lshlrev_b16_e32 v20, 8, v24
 ; VI-NEXT:    s_waitcnt vmcnt(11)
-; VI-NEXT:    v_lshlrev_b16_e32 v26, 8, v28
+; VI-NEXT:    v_lshlrev_b16_e32 v24, 8, v26
+; VI-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:124
 ; VI-NEXT:    s_waitcnt vmcnt(10)
-; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v30
+; VI-NEXT:    v_lshlrev_b16_e32 v36, 8, v30
 ; VI-NEXT:    s_waitcnt vmcnt(9)
-; VI-NEXT:    v_lshlrev_b16_e32 v34, 8, v31
+; VI-NEXT:    v_lshlrev_b16_e32 v33, 8, v31
 ; VI-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_ushort v34, off, s[0:3], s32 offset:92
 ; VI-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:84
 ; VI-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:68
 ; VI-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:52
+; VI-NEXT:    v_lshlrev_b16_e32 v28, 8, v28
 ; VI-NEXT:    s_waitcnt vmcnt(14)
 ; VI-NEXT:    v_lshlrev_b16_e32 v32, 8, v32
 ; VI-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
@@ -47114,79 +47149,80 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
-; VI-NEXT:    v_or_b32_sdwa v9, v39, v47 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v10, v20, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v28, v59 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v22, v62 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v30, v16 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v35, v18 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v31, v33 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v16, v38, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr39
-; VI-NEXT:    ; implicit-def: $vgpr20
-; VI-NEXT:    ; implicit-def: $vgpr28
+; VI-NEXT:    s_waitcnt vmcnt(14)
+; VI-NEXT:    v_or_b32_sdwa v9, v29, v47 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v41, v57 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v46, v59 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_or_b32_sdwa v12, v22, v61 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v30, v38 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v34, v24 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v31, v36 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v16, v16, v32 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr29
+; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr46
 ; VI-NEXT:    ; implicit-def: $vgpr22
 ; VI-NEXT:    ; implicit-def: $vgpr30
-; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr34
 ; VI-NEXT:    ; implicit-def: $vgpr31
-; VI-NEXT:    ; implicit-def: $vgpr38
 ; VI-NEXT:    ; implicit-def: $vgpr47
 ; VI-NEXT:    ; implicit-def: $vgpr57
 ; VI-NEXT:    ; implicit-def: $vgpr59
-; VI-NEXT:    ; implicit-def: $vgpr62
-; VI-NEXT:    ; implicit-def: $vgpr18
-; VI-NEXT:    ; implicit-def: $vgpr33
+; VI-NEXT:    ; implicit-def: $vgpr61
+; VI-NEXT:    ; implicit-def: $vgpr38
+; VI-NEXT:    ; implicit-def: $vgpr24
+; VI-NEXT:    ; implicit-def: $vgpr36
 ; VI-NEXT:    ; implicit-def: $vgpr32
 ; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_or_b32_sdwa v0, v0, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v0, v0, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_sdwa v1, v1, v51 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v1, v1, v48 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v53 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v50 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; VI-NEXT:    ; implicit-def: $vgpr39
+; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    ; implicit-def: $vgpr49
 ; VI-NEXT:    ; implicit-def: $vgpr50
-; VI-NEXT:    ; implicit-def: $vgpr51
-; VI-NEXT:    ; implicit-def: $vgpr52
-; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v19 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(5)
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v23 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v27 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr19
 ; VI-NEXT:    ; implicit-def: $vgpr23
-; VI-NEXT:    ; implicit-def: $vgpr27
+; VI-NEXT:    ; implicit-def: $vgpr40
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_or_b32_sdwa v8, v8, v45 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; VI-NEXT:    ; implicit-def: $vgpr45
+; VI-NEXT:    v_or_b32_sdwa v8, v8, v43 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr43
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v2, v2, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v2, v2, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v40 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v52 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v42 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v4, v4, v54 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    ; implicit-def: $vgpr51
+; VI-NEXT:    ; implicit-def: $vgpr52
 ; VI-NEXT:    ; implicit-def: $vgpr54
-; VI-NEXT:    ; implicit-def: $vgpr40
-; VI-NEXT:    ; implicit-def: $vgpr42
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v3, v3, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v3, v3, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
-; VI-NEXT:    ; implicit-def: $vgpr41
+; VI-NEXT:    ; implicit-def: $vgpr53
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
@@ -47203,23 +47239,23 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; VI-NEXT:    ; implicit-def: $vgpr25
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_or_b32_sdwa v7, v7, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v7, v7, v42 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v8, v49, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v8, v27, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v9, v48, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v9, v55, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v10, v55, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v10, v44, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v10, v10, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v11, v36, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v11, v18, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v11, v11, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v12, v43, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v12, v26, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v12, v12, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v13, v37, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v13, v37, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v13, v13, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v14, v44, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v14, v35, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v14, v14, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_or_b32_sdwa v15, v61, v34 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v15, v63, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v15, v15, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
@@ -47253,98 +47289,92 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    ; kill: killed $vgpr16
 ; VI-NEXT:    ; implicit-def: $vgpr16
 ; VI-NEXT:    ; kill: killed $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr49
-; VI-NEXT:    ; implicit-def: $vgpr48
+; VI-NEXT:    ; implicit-def: $vgpr27
 ; VI-NEXT:    ; implicit-def: $vgpr55
-; VI-NEXT:    ; implicit-def: $vgpr36
-; VI-NEXT:    ; implicit-def: $vgpr43
-; VI-NEXT:    ; implicit-def: $vgpr37
 ; VI-NEXT:    ; implicit-def: $vgpr44
-; VI-NEXT:    ; implicit-def: $vgpr61
-; VI-NEXT:    ; implicit-def: $vgpr29
-; VI-NEXT:    ; implicit-def: $vgpr46
+; VI-NEXT:    ; implicit-def: $vgpr18
+; VI-NEXT:    ; implicit-def: $vgpr26
+; VI-NEXT:    ; implicit-def: $vgpr37
+; VI-NEXT:    ; implicit-def: $vgpr35
+; VI-NEXT:    ; implicit-def: $vgpr63
+; VI-NEXT:    ; implicit-def: $vgpr16
+; VI-NEXT:    ; implicit-def: $vgpr42
+; VI-NEXT:    ; implicit-def: $vgpr45
 ; VI-NEXT:    ; implicit-def: $vgpr56
 ; VI-NEXT:    ; implicit-def: $vgpr58
 ; VI-NEXT:    ; implicit-def: $vgpr60
-; VI-NEXT:    ; implicit-def: $vgpr63
-; VI-NEXT:    ; implicit-def: $vgpr16
-; VI-NEXT:    ; implicit-def: $vgpr24
-; VI-NEXT:    ; implicit-def: $vgpr26
-; VI-NEXT:    ; implicit-def: $vgpr34
+; VI-NEXT:    ; implicit-def: $vgpr62
+; VI-NEXT:    ; implicit-def: $vgpr20
+; VI-NEXT:    ; implicit-def: $vgpr28
+; VI-NEXT:    ; implicit-def: $vgpr33
 ; VI-NEXT:  .LBB55_2: ; %Flow
 ; VI-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; VI-NEXT:    s_cbranch_execz .LBB55_4
 ; VI-NEXT:  ; %bb.3: ; %cmp.true
 ; VI-NEXT:    s_waitcnt vmcnt(8)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v38
-; VI-NEXT:    v_add_u16_e32 v2, 3, v44
+; VI-NEXT:    v_add_u16_e32 v0, 3, v16
 ; VI-NEXT:    v_or_b32_sdwa v0, v32, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_or_b32_sdwa v14, v26, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_add_u16_e32 v2, 3, v35
 ; VI-NEXT:    v_mov_b32_e32 v3, 0x300
-; VI-NEXT:    v_or_b32_sdwa v2, v18, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v18, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v37
-; VI-NEXT:    v_or_b32_sdwa v24, v24, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v16, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(4)
+; VI-NEXT:    v_add_u16_e32 v0, 3, v37
+; VI-NEXT:    v_or_b32_sdwa v20, v20, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    s_waitcnt vmcnt(3)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v30
-; VI-NEXT:    v_or_b32_sdwa v0, v16, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v38, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v12, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v43
-; VI-NEXT:    v_or_b32_sdwa v16, v63, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v22
-; VI-NEXT:    v_or_b32_sdwa v0, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v11, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v36
-; VI-NEXT:    v_or_b32_sdwa v22, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v26
+; VI-NEXT:    v_or_b32_sdwa v26, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v28
+; VI-NEXT:    v_add_u16_e32 v0, 3, v22
+; VI-NEXT:    v_or_b32_sdwa v0, v61, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v11, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_add_u16_e32 v0, 3, v18
+; VI-NEXT:    v_or_b32_sdwa v18, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v46
 ; VI-NEXT:    v_or_b32_sdwa v0, v59, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v10, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v55
-; VI-NEXT:    v_or_b32_sdwa v28, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_add_u16_e32 v0, 3, v20
+; VI-NEXT:    v_add_u16_e32 v0, 3, v44
+; VI-NEXT:    v_or_b32_sdwa v22, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v41
 ; VI-NEXT:    v_or_b32_sdwa v0, v57, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v2, 3, v35
 ; VI-NEXT:    v_add_u16_sdwa v9, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v48
-; VI-NEXT:    v_or_b32_sdwa v20, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_e32 v0, 3, v39
+; VI-NEXT:    v_add_u16_e32 v0, 3, v55
+; VI-NEXT:    v_or_b32_sdwa v14, v28, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v28, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v29
 ; VI-NEXT:    v_or_b32_sdwa v0, v47, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v8, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v0, 3, v49
-; VI-NEXT:    v_or_b32_sdwa v30, v46, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v0, 3, v27
+; VI-NEXT:    v_or_b32_sdwa v27, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
+; VI-NEXT:    v_add_u16_e32 v2, 3, v34
+; VI-NEXT:    v_or_b32_sdwa v2, v24, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v13, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v1, 3, v61
-; VI-NEXT:    v_or_b32_sdwa v15, v34, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_e32 v1, 3, v63
+; VI-NEXT:    v_or_b32_sdwa v15, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v1, 3, v31
-; VI-NEXT:    v_or_b32_sdwa v1, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; VI-NEXT:    v_add_u16_sdwa v26, v1, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_add_u16_e32 v16, 0x300, v16
-; VI-NEXT:    v_or_b32_e32 v12, v16, v12
-; VI-NEXT:    v_add_u16_e32 v16, 0x300, v24
+; VI-NEXT:    v_or_b32_sdwa v1, v36, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_add_u16_sdwa v24, v1, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    v_add_u16_e32 v14, 0x300, v14
 ; VI-NEXT:    v_add_u16_e32 v15, 0x300, v15
-; VI-NEXT:    v_or_b32_e32 v13, v16, v13
-; VI-NEXT:    v_or_b32_e32 v14, v14, v26
-; VI-NEXT:    v_or_b32_e32 v15, v15, v18
+; VI-NEXT:    v_or_b32_e32 v14, v14, v24
+; VI-NEXT:    v_or_b32_e32 v15, v15, v16
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v7, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v29, v29, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v29, v42, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v27, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v6, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -47377,54 +47407,58 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; VI-NEXT:    v_or_b32_e32 v6, v17, v6
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v29
 ; VI-NEXT:    v_or_b32_e32 v7, v17, v7
-; VI-NEXT:    v_add_u16_e32 v17, 0x300, v30
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v27
 ; VI-NEXT:    v_or_b32_e32 v8, v17, v8
-; VI-NEXT:    v_add_u16_e32 v17, 0x300, v20
-; VI-NEXT:    v_or_b32_e32 v9, v17, v9
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v28
-; VI-NEXT:    v_or_b32_e32 v10, v17, v10
+; VI-NEXT:    v_or_b32_e32 v9, v17, v9
 ; VI-NEXT:    v_add_u16_e32 v17, 0x300, v22
+; VI-NEXT:    v_or_b32_e32 v10, v17, v10
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v18
 ; VI-NEXT:    v_or_b32_e32 v11, v17, v11
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v26
+; VI-NEXT:    v_or_b32_e32 v12, v17, v12
+; VI-NEXT:    v_add_u16_e32 v17, 0x300, v20
+; VI-NEXT:    v_or_b32_e32 v13, v17, v13
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v42, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v19, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v23, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v23, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v2, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v27, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v30, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v1, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v31, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v31, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v0, 3, v0
-; VI-NEXT:    v_or_b32_sdwa v0, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v0, v48, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_sdwa v0, v0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_add_u16_e32 v3, 3, v3
-; VI-NEXT:    v_or_b32_sdwa v3, v50, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; VI-NEXT:    v_or_b32_sdwa v3, v39, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v3, v0
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v31
 ; VI-NEXT:    v_or_b32_e32 v1, v3, v1
-; VI-NEXT:    v_add_u16_e32 v3, 0x300, v27
+; VI-NEXT:    v_add_u16_e32 v3, 0x300, v30
 ; VI-NEXT:    v_or_b32_e32 v2, v3, v2
 ; VI-NEXT:    v_add_u16_e32 v3, 0x300, v23
 ; VI-NEXT:    v_or_b32_e32 v3, v3, v19
@@ -47484,99 +47518,100 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:256 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:224 ; 4-byte Folded Spill
 ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Spill
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:132
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:132
 ; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
-; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_ushort v2, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_ushort v4, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ushort v6, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ushort v8, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ushort v10, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ushort v12, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_ushort v14, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_ushort v22, off, s[0:3], s32 offset:80
 ; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:88
 ; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:96
 ; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:104
-; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_ushort v33, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_ushort v52, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ushort v39, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_ushort v44, off, s[0:3], s32 offset:68
-; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:36
-; GFX9-NEXT:    buffer_load_ushort v51, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v1
-; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v3
-; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v5
-; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v7
-; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v9
-; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v11
-; GFX9-NEXT:    v_lshlrev_b16_e32 v41, 8, v13
-; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v15
-; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v17
+; GFX9-NEXT:    buffer_load_ushort v30, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:120
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    v_lshlrev_b16_e32 v55, 8, v17
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v17, 8, v19
-; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v19, 8, v23
-; GFX9-NEXT:    v_lshlrev_b16_e32 v25, 8, v25
+; GFX9-NEXT:    v_lshlrev_b16_e32 v40, 8, v25
 ; GFX9-NEXT:    v_lshlrev_b16_e32 v23, 8, v27
-; GFX9-NEXT:    v_lshlrev_b16_e32 v29, 8, v29
-; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    v_lshlrev_b16_e32 v44, 8, v29
+; GFX9-NEXT:    buffer_load_ushort v41, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_ushort v47, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_ushort v27, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ushort v42, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ushort v25, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ushort v29, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ushort v38, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    buffer_load_ushort v36, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    v_lshlrev_b16_e32 v48, 8, v1
+; GFX9-NEXT:    v_lshlrev_b16_e32 v39, 8, v3
+; GFX9-NEXT:    v_lshlrev_b16_e32 v50, 8, v5
+; GFX9-NEXT:    v_lshlrev_b16_e32 v49, 8, v7
+; GFX9-NEXT:    v_lshlrev_b16_e32 v52, 8, v9
+; GFX9-NEXT:    v_lshlrev_b16_e32 v51, 8, v11
+; GFX9-NEXT:    v_lshlrev_b16_e32 v54, 8, v13
+; GFX9-NEXT:    v_lshlrev_b16_e32 v53, 8, v15
+; GFX9-NEXT:    v_lshlrev_b16_e32 v21, 8, v21
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
 ; GFX9-NEXT:    s_waitcnt vmcnt(24)
-; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v22
+; GFX9-NEXT:    v_lshlrev_b16_e32 v43, 8, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v27, 8, v0
+; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v4
 ; GFX9-NEXT:    s_waitcnt vmcnt(22)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v46, 8, v2
+; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v6
 ; GFX9-NEXT:    s_waitcnt vmcnt(21)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v45, 8, v4
+; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v8
 ; GFX9-NEXT:    s_waitcnt vmcnt(20)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v57, 8, v6
+; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v10
 ; GFX9-NEXT:    s_waitcnt vmcnt(19)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v56, 8, v8
+; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(18)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v59, 8, v10
+; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v14
 ; GFX9-NEXT:    s_waitcnt vmcnt(17)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v58, 8, v12
+; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v61, 8, v14
+; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v18
 ; GFX9-NEXT:    s_waitcnt vmcnt(15)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v60, 8, v16
+; GFX9-NEXT:    v_lshlrev_b16_e32 v63, 8, v20
 ; GFX9-NEXT:    s_waitcnt vmcnt(14)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v18
+; GFX9-NEXT:    v_lshlrev_b16_e32 v62, 8, v22
 ; GFX9-NEXT:    s_waitcnt vmcnt(13)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v36, 8, v20
+; GFX9-NEXT:    v_lshlrev_b16_e32 v22, 8, v24
 ; GFX9-NEXT:    s_waitcnt vmcnt(12)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v24
-; GFX9-NEXT:    buffer_load_ushort v18, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    v_lshlrev_b16_e32 v18, 8, v26
 ; GFX9-NEXT:    s_waitcnt vmcnt(11)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v24, 8, v28
+; GFX9-NEXT:    v_lshlrev_b16_e32 v34, 8, v28
 ; GFX9-NEXT:    s_waitcnt vmcnt(10)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v20, 8, v31
+; GFX9-NEXT:    v_lshlrev_b16_e32 v30, 8, v30
 ; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v31, 8, v32
-; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_ushort v28, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_ushort v34, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_ushort v63, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_ushort v62, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    v_lshlrev_b16_e32 v26, 8, v26
-; GFX9-NEXT:    s_waitcnt vmcnt(16)
-; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v33
+; GFX9-NEXT:    v_lshlrev_b16_e32 v28, 8, v31
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    v_lshlrev_b16_e32 v33, 8, v32
+; GFX9-NEXT:    buffer_load_ushort v31, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_ushort v32, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_ushort v20, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_ushort v24, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_ushort v37, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_ushort v16, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_ushort v26, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ushort v35, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
 ; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX9-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB55_2
 ; GFX9-NEXT:  ; %bb.1: ; %cmp.false
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
@@ -47584,108 +47619,107 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
-; GFX9-NEXT:    s_waitcnt vmcnt(23)
-; GFX9-NEXT:    v_or_b32_sdwa v9, v39, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    s_waitcnt vmcnt(9)
-; GFX9-NEXT:    v_or_b32_sdwa v10, v62, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v11, v63, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v12, v35, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v13, v34, v36 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v14, v16, v26 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v15, v32, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v16, v38, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr39
-; GFX9-NEXT:    ; implicit-def: $vgpr62
-; GFX9-NEXT:    ; implicit-def: $vgpr63
-; GFX9-NEXT:    ; implicit-def: $vgpr35
-; GFX9-NEXT:    ; implicit-def: $vgpr34
-; GFX9-NEXT:    ; implicit-def: $vgpr32
-; GFX9-NEXT:    ; implicit-def: $vgpr38
+; GFX9-NEXT:    s_waitcnt vmcnt(20)
+; GFX9-NEXT:    v_or_b32_sdwa v9, v25, v45 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v27, v56 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v41, v58 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    s_waitcnt vmcnt(10)
+; GFX9-NEXT:    v_or_b32_sdwa v12, v26, v60 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v37, v62 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v20, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v31, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr25
+; GFX9-NEXT:    ; implicit-def: $vgpr27
+; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr26
+; GFX9-NEXT:    ; implicit-def: $vgpr37
+; GFX9-NEXT:    ; implicit-def: $vgpr20
+; GFX9-NEXT:    ; implicit-def: $vgpr31
 ; GFX9-NEXT:    ; implicit-def: $vgpr45
 ; GFX9-NEXT:    ; implicit-def: $vgpr56
 ; GFX9-NEXT:    ; implicit-def: $vgpr58
 ; GFX9-NEXT:    ; implicit-def: $vgpr60
-; GFX9-NEXT:    ; implicit-def: $vgpr36
-; GFX9-NEXT:    ; implicit-def: $vgpr26
-; GFX9-NEXT:    ; implicit-def: $vgpr20
-; GFX9-NEXT:    ; implicit-def: $vgpr33
+; GFX9-NEXT:    ; implicit-def: $vgpr62
+; GFX9-NEXT:    ; implicit-def: $vgpr18
+; GFX9-NEXT:    ; implicit-def: $vgpr30
 ; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v39 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v1, v1, v48 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v50 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v49 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s6
 ; GFX9-NEXT:    v_perm_b32 v1, v3, v2, s6
 ; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v53 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v17 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(4)
 ; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v19 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(3)
 ; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v23 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr49
 ; GFX9-NEXT:    ; implicit-def: $vgpr48
-; GFX9-NEXT:    ; implicit-def: $vgpr53
+; GFX9-NEXT:    ; implicit-def: $vgpr39
 ; GFX9-NEXT:    ; implicit-def: $vgpr50
-; GFX9-NEXT:    ; implicit-def: $vgpr40
+; GFX9-NEXT:    ; implicit-def: $vgpr49
+; GFX9-NEXT:    ; implicit-def: $vgpr53
 ; GFX9-NEXT:    ; implicit-def: $vgpr17
 ; GFX9-NEXT:    ; implicit-def: $vgpr19
 ; GFX9-NEXT:    ; implicit-def: $vgpr23
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v27 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    ; implicit-def: $vgpr27
+; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    ; implicit-def: $vgpr43
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v2, v52 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v51 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v2, v3, v2, s6
 ; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:260 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr55
-; GFX9-NEXT:    ; implicit-def: $vgpr54
+; GFX9-NEXT:    ; implicit-def: $vgpr52
+; GFX9-NEXT:    ; implicit-def: $vgpr51
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v41 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v54 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v3, v4, v3, s6
 ; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:232 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr41
+; GFX9-NEXT:    ; implicit-def: $vgpr54
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v43 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v55 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v4, v5, v4, s6
 ; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:236 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr43
+; GFX9-NEXT:    ; implicit-def: $vgpr55
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_or_b32_sdwa v5, v5, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v5, v6, v5, s6
 ; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
 ; GFX9-NEXT:    ; implicit-def: $vgpr21
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v25 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v6, v6, v40 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v6, v7, v6, s6
 ; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
-; GFX9-NEXT:    ; implicit-def: $vgpr25
+; GFX9-NEXT:    ; implicit-def: $vgpr40
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v29 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v7, v7, v44 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v7, v8, v7, s6
-; GFX9-NEXT:    v_or_b32_sdwa v8, v51, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v8, v29, v46 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v8, v9, v8, s6
-; GFX9-NEXT:    v_or_b32_sdwa v9, v52, v57 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v9, v42, v57 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v9, v10, v9, s6
-; GFX9-NEXT:    v_or_b32_sdwa v10, v42, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v10, v47, v59 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v10, v11, v10, s6
-; GFX9-NEXT:    v_or_b32_sdwa v11, v37, v61 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v11, v35, v61 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v11, v12, v11, s6
-; GFX9-NEXT:    v_or_b32_sdwa v12, v44, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v12, v16, v63 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v12, v13, v12, s6
-; GFX9-NEXT:    v_or_b32_sdwa v13, v28, v30 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v13, v24, v22 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v13, v14, v13, s6
-; GFX9-NEXT:    v_or_b32_sdwa v14, v47, v24 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v14, v32, v34 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v14, v15, v14, s6
-; GFX9-NEXT:    v_or_b32_sdwa v15, v18, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v15, v38, v28 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v16, v36, v33 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_perm_b32 v15, v16, v15, s6
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
@@ -47719,111 +47753,110 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    ; kill: killed $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; kill: killed $vgpr16
-; GFX9-NEXT:    ; implicit-def: $vgpr51
-; GFX9-NEXT:    ; implicit-def: $vgpr52
+; GFX9-NEXT:    ; implicit-def: $vgpr29
 ; GFX9-NEXT:    ; implicit-def: $vgpr42
-; GFX9-NEXT:    ; implicit-def: $vgpr37
-; GFX9-NEXT:    ; implicit-def: $vgpr44
-; GFX9-NEXT:    ; implicit-def: $vgpr28
-; GFX9-NEXT:    ; implicit-def: $vgpr16
 ; GFX9-NEXT:    ; implicit-def: $vgpr47
-; GFX9-NEXT:    ; implicit-def: $vgpr18
-; GFX9-NEXT:    ; implicit-def: $vgpr29
+; GFX9-NEXT:    ; implicit-def: $vgpr35
+; GFX9-NEXT:    ; implicit-def: $vgpr16
+; GFX9-NEXT:    ; implicit-def: $vgpr24
+; GFX9-NEXT:    ; implicit-def: $vgpr32
+; GFX9-NEXT:    ; implicit-def: $vgpr38
+; GFX9-NEXT:    ; implicit-def: $vgpr36
+; GFX9-NEXT:    ; implicit-def: $vgpr44
 ; GFX9-NEXT:    ; implicit-def: $vgpr46
 ; GFX9-NEXT:    ; implicit-def: $vgpr57
 ; GFX9-NEXT:    ; implicit-def: $vgpr59
 ; GFX9-NEXT:    ; implicit-def: $vgpr61
+; GFX9-NEXT:    ; implicit-def: $vgpr63
 ; GFX9-NEXT:    ; implicit-def: $vgpr22
-; GFX9-NEXT:    ; implicit-def: $vgpr30
-; GFX9-NEXT:    ; implicit-def: $vgpr24
-; GFX9-NEXT:    ; implicit-def: $vgpr31
+; GFX9-NEXT:    ; implicit-def: $vgpr34
+; GFX9-NEXT:    ; implicit-def: $vgpr28
+; GFX9-NEXT:    ; implicit-def: $vgpr33
 ; GFX9-NEXT:  .LBB55_2: ; %Flow
 ; GFX9-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GFX9-NEXT:    s_cbranch_execz .LBB55_4
 ; GFX9-NEXT:  ; %bb.3: ; %cmp.true
-; GFX9-NEXT:    s_waitcnt vmcnt(8)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v18
-; GFX9-NEXT:    v_or_b32_sdwa v0, v31, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v38
+; GFX9-NEXT:    v_or_b32_sdwa v0, v28, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v15, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(5)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v28
-; GFX9-NEXT:    v_or_b32_sdwa v0, v30, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v24
+; GFX9-NEXT:    v_or_b32_sdwa v0, v22, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v13, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v20
+; GFX9-NEXT:    v_or_b32_sdwa v0, v18, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v0
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v16
-; GFX9-NEXT:    v_or_b32_sdwa v0, v26, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v44
-; GFX9-NEXT:    v_or_b32_sdwa v0, v22, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v63, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v12, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(4)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v34
-; GFX9-NEXT:    v_or_b32_sdwa v0, v36, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v37
+; GFX9-NEXT:    v_or_b32_sdwa v0, v62, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v16, 0x300, v0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v35
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v61, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v11, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v35
-; GFX9-NEXT:    v_add_u16_e32 v2, 3, v47
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v26
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v60, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_or_b32_sdwa v2, v24, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v42
+; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v47
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v59, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v10, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v63
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v41
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v58, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v52
+; GFX9-NEXT:    v_add_u16_e32 v22, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v42
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v57, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v9, 0x300, v0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v62
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v27
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v56, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v51
+; GFX9-NEXT:    v_add_u16_e32 v24, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v29
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v46, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v8, 0x300, v0
-; GFX9-NEXT:    v_add_u16_e32 v0, 3, v39
+; GFX9-NEXT:    v_add_u16_e32 v0, 3, v25
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v45, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v0
+; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:244 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_add_u16_e32 v3, 3, v32
-; GFX9-NEXT:    v_or_b32_sdwa v3, v20, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v20, 0x300, v3
+; GFX9-NEXT:    v_add_u16_e32 v3, 3, v31
+; GFX9-NEXT:    v_or_b32_sdwa v3, v30, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v30, 0x300, v3
+; GFX9-NEXT:    v_add_u16_e32 v2, 3, v32
+; GFX9-NEXT:    v_or_b32_sdwa v2, v34, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v14, 0x300, v2
-; GFX9-NEXT:    v_add_u16_e32 v1, 3, v38
+; GFX9-NEXT:    v_add_u16_e32 v1, 3, v36
 ; GFX9-NEXT:    v_or_b32_sdwa v1, v33, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v18, 0x300, v1
+; GFX9-NEXT:    v_add_u16_e32 v28, 0x300, v1
 ; GFX9-NEXT:    s_mov_b32 s6, 0x5040100
-; GFX9-NEXT:    v_perm_b32 v8, v30, v8, s6
-; GFX9-NEXT:    v_perm_b32 v9, v28, v9, s6
-; GFX9-NEXT:    v_perm_b32 v10, v26, v10, s6
-; GFX9-NEXT:    v_perm_b32 v11, v24, v11, s6
-; GFX9-NEXT:    v_perm_b32 v12, v22, v12, s6
-; GFX9-NEXT:    v_perm_b32 v13, v16, v13, s6
-; GFX9-NEXT:    v_perm_b32 v14, v20, v14, s6
-; GFX9-NEXT:    v_perm_b32 v15, v18, v15, s6
+; GFX9-NEXT:    v_perm_b32 v8, v25, v8, s6
+; GFX9-NEXT:    v_perm_b32 v9, v24, v9, s6
+; GFX9-NEXT:    v_perm_b32 v10, v22, v10, s6
+; GFX9-NEXT:    v_perm_b32 v11, v20, v11, s6
+; GFX9-NEXT:    v_perm_b32 v12, v16, v12, s6
+; GFX9-NEXT:    v_perm_b32 v13, v18, v13, s6
+; GFX9-NEXT:    v_perm_b32 v14, v30, v14, s6
+; GFX9-NEXT:    v_perm_b32 v15, v28, v15, s6
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:224 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v29, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v44, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v7, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_u16_e32 v31, 3, v31
-; GFX9-NEXT:    v_or_b32_sdwa v31, v48, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v31, v39, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v31, 0x300, v31
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v27, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v27, 0x300, v0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v26, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:240 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_perm_b32 v7, v27, v7, s6
+; GFX9-NEXT:    v_perm_b32 v7, v26, v7, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v25, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v6, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -47845,7 +47878,7 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v5, v19, v5, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v43, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v55, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v4, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -47856,40 +47889,40 @@ define <32 x bfloat> @bitcast_v64i8_to_v32bf16(<64 x i8> %a, i32 %b) {
 ; GFX9-NEXT:    v_perm_b32 v4, v17, v4, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v41, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v3, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v40, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v21, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:248 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v3, v21, v3, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v55, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v52, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v2, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:220 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v54, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX9-NEXT:    v_add_u16_e32 v25, 0x300, v0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v51, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_add_u16_e32 v27, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:256 ; 4-byte Folded Reload
-; GFX9-NEXT:    v_perm_b32 v2, v25, v2, s6
+; GFX9-NEXT:    v_perm_b32 v2, v27, v2, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v53, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v1, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:228 ; 4-byte Folded Reload
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v50, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v29, 0x300, v0
 ; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:252 ; 4-byte Folded Reload
 ; GFX9-NEXT:    v_perm_b32 v1, v29, v1, s6
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_add_u16_e32 v0, 3, v0
-; GFX9-NEXT:    v_or_b32_sdwa v0, v49, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX9-NEXT:    v_or_b32_sdwa v0, v48, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX9-NEXT:    v_add_u16_e32 v0, 0x300, v0
 ; GFX9-NEXT:    v_perm_b32 v0, v31, v0, s6
 ; GFX9-NEXT:  .LBB55_4: ; %end
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll
index 1ef7d358d8cae..b992dbdf27fb7 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll
@@ -12280,14 +12280,14 @@ define <36 x half> @bitcast_v36i16_to_v36f16(<36 x i16> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v61, off, s[0:3], s32 offset:36 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:32 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:28 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v34, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(6)
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32
+; GCN-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    ; implicit-def: $vgpr31
 ; GCN-NEXT:    ; kill: killed $vgpr31
@@ -12386,12 +12386,12 @@ define <36 x half> @bitcast_v36i16_to_v36f16(<36 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v29
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v63, v30
 ; GCN-NEXT:    s_waitcnt vmcnt(9)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v34
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v35
-; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v38
-; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v39
+; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v39
+; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v34
+; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v35
+; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v36
+; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v37
+; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v38
 ; GCN-NEXT:    ; implicit-def: $vgpr1
 ; GCN-NEXT:    ; implicit-def: $vgpr2
 ; GCN-NEXT:    ; implicit-def: $vgpr3
@@ -12422,28 +12422,23 @@ define <36 x half> @bitcast_v36i16_to_v36f16(<36 x i16> %a, i32 %b) {
 ; GCN-NEXT:    ; implicit-def: $vgpr28
 ; GCN-NEXT:    ; implicit-def: $vgpr29
 ; GCN-NEXT:    ; implicit-def: $vgpr30
+; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:    ; implicit-def: $vgpr34
 ; GCN-NEXT:    ; implicit-def: $vgpr35
 ; GCN-NEXT:    ; implicit-def: $vgpr36
 ; GCN-NEXT:    ; implicit-def: $vgpr37
 ; GCN-NEXT:    ; implicit-def: $vgpr38
-; GCN-NEXT:    ; implicit-def: $vgpr39
 ; GCN-NEXT:  .LBB28_2: ; %Flow
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB28_4
 ; GCN-NEXT:  ; %bb.3: ; %cmp.true
-; GCN-NEXT:    s_waitcnt vmcnt(5)
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v39
-; GCN-NEXT:    s_waitcnt vmcnt(4)
-; GCN-NEXT:    v_add_i32_e32 v38, vcc, 3, v38
-; GCN-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v37
-; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    v_add_i32_e32 v36, vcc, 3, v36
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v35
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 3, v38
+; GCN-NEXT:    v_add_i32_e32 v37, vcc, 3, v37
+; GCN-NEXT:    v_add_i32_e32 v32, vcc, 3, v36
+; GCN-NEXT:    v_add_i32_e32 v35, vcc, 3, v35
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 3, v34
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v34
+; GCN-NEXT:    v_add_i32_e32 v34, vcc, 3, v39
 ; GCN-NEXT:    v_add_i32_e32 v30, vcc, 3, v30
 ; GCN-NEXT:    v_add_i32_e32 v29, vcc, 3, v29
 ; GCN-NEXT:    v_add_i32_e32 v28, vcc, 3, v28
@@ -12524,9 +12519,9 @@ define <36 x half> @bitcast_v36i16_to_v36f16(<36 x i16> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v63, v30
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v34
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
-; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v36
+; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v35
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
-; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v37
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
 ; GCN-NEXT:  .LBB28_4: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
@@ -13007,25 +13002,6 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-LABEL: bitcast_v36f16_to_v36i16:
 ; GCN:       ; %bb.0:
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT:    buffer_store_dword v40, off, s[0:3], s32 offset:44 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v41, off, s[0:3], s32 offset:40 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:36 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:32 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:28 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:16
-; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:12
-; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:8
-; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:4
-; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:20
-; GCN-NEXT:    s_waitcnt vmcnt(6)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v31
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v2
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v3
@@ -13046,37 +13022,38 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v19
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v20
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v21
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v21
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v22
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v24
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v26
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v27
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v28
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v29
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v24
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v25
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v26
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v28
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:20
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v30
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v43
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v42
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v41
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v40
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v55
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v3
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v21
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v22
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB29_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.true
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
-; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v21
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
 ; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
@@ -13085,14 +13062,6 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v4
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v21
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
-; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
-; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v6
-; GCN-NEXT:    v_or_b32_e32 v5, v5, v21
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
@@ -13101,6 +13070,22 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v8
 ; GCN-NEXT:    v_or_b32_e32 v7, v7, v21
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
+; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v2
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v21
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
+; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
+; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v6
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v21
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
@@ -13152,12 +13137,12 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v39
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v37
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
+; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
 ; GCN-NEXT:    v_add_f32_e32 v22, 0x38000000, v22
@@ -13170,12 +13155,12 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
 ; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
-; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
-; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
-; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
+; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
+; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
+; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
@@ -13188,12 +13173,12 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
@@ -13201,27 +13186,27 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
 ; GCN-NEXT:    v_or_b32_e32 v52, v22, v21
 ; GCN-NEXT:    v_or_b32_e32 v50, v24, v23
 ; GCN-NEXT:    v_or_b32_e32 v48, v26, v25
 ; GCN-NEXT:    v_or_b32_e32 v38, v28, v27
 ; GCN-NEXT:    v_or_b32_e32 v37, v30, v29
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v20
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v19
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v17
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v15
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v20
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v18
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v19
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v17
 ; GCN-NEXT:    v_alignbit_b32 v54, v35, v21, 16
 ; GCN-NEXT:    v_alignbit_b32 v53, v33, v23, 16
 ; GCN-NEXT:    v_alignbit_b32 v51, v31, v25, 16
 ; GCN-NEXT:    v_alignbit_b32 v49, v11, v27, 16
 ; GCN-NEXT:    v_alignbit_b32 v39, v9, v29, 16
-; GCN-NEXT:    v_alignbit_b32 v20, v7, v20, 16
-; GCN-NEXT:    v_alignbit_b32 v19, v5, v19, 16
+; GCN-NEXT:    v_alignbit_b32 v20, v5, v20, 16
+; GCN-NEXT:    v_alignbit_b32 v18, v1, v18, 16
+; GCN-NEXT:    v_alignbit_b32 v19, v7, v19, 16
 ; GCN-NEXT:    v_alignbit_b32 v17, v3, v17, 16
-; GCN-NEXT:    v_alignbit_b32 v15, v1, v15, 16
 ; GCN-NEXT:  .LBB29_2: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v52
@@ -13263,38 +13248,38 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
 ; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
 ; GCN-NEXT:    v_add_i32_e32 v10, vcc, 36, v0
-; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
+; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v20
+; GCN-NEXT:    v_or_b32_e32 v16, v16, v20
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 40, v0
-; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
-; GCN-NEXT:    v_add_i32_e32 v8, vcc, 44, v0
-; GCN-NEXT:    v_and_b32_e32 v16, 0xffff, v16
-; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
-; GCN-NEXT:    v_or_b32_e32 v16, v16, v19
-; GCN-NEXT:    v_add_i32_e32 v19, vcc, 48, v0
 ; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
 ; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, 52, v0
+; GCN-NEXT:    v_add_i32_e32 v6, vcc, 44, v0
 ; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v18
+; GCN-NEXT:    v_add_i32_e32 v18, vcc, 48, v0
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 52, v0
+; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v19, 16, v19
+; GCN-NEXT:    v_or_b32_e32 v15, v15, v19
+; GCN-NEXT:    v_add_i32_e32 v19, vcc, 56, v0
+; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 60, v0
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
-; GCN-NEXT:    v_or_b32_e32 v14, v14, v17
-; GCN-NEXT:    v_add_i32_e32 v17, vcc, 56, v0
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v17
+; GCN-NEXT:    v_add_i32_e32 v17, vcc, 64, v0
 ; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
 ; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, 60, v0
-; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v15
-; GCN-NEXT:    v_add_i32_e32 v15, vcc, 64, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x44, v0
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, 0x44, v0
 ; GCN-NEXT:    buffer_store_dword v21, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v22, v23, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v24, v25, s[0:3], 0 offen
@@ -13305,19 +13290,14 @@ define <36 x i16> @bitcast_v36f16_to_v36i16(<36 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v11, v12, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v34, v35, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v9, v10, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v18, v20, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v7, v8, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v16, v19, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v16, v20, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v5, v6, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v14, v17, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v3, v4, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v13, v15, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v14, v18, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:28 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:32 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:36 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:40 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:44 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_store_dword v15, v19, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v7, v8, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v13, v17, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v4, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll
index a0fe407022d81..1c70666ccb46d 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll
@@ -16303,30 +16303,19 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-LABEL: bitcast_v44f16_to_v44i16:
 ; GCN:       ; %bb.0:
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT:    buffer_store_dword v40, off, s[0:3], s32 offset:92 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v41, off, s[0:3], s32 offset:88 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:84 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:80 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:72 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v46, off, s[0:3], s32 offset:68 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v47, off, s[0:3], s32 offset:64 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_store_dword v56, off, s[0:3], s32 offset:60 ; 4-byte Folded Spill
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:56
-; GCN-NEXT:    s_waitcnt expcnt(6)
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:20
-; GCN-NEXT:    s_waitcnt expcnt(5)
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:16
-; GCN-NEXT:    s_waitcnt expcnt(4)
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_store_dword v40, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v41, off, s[0:3], s32 offset:72 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v42, off, s[0:3], s32 offset:68 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:64 ; 4-byte Folded Spill
+; GCN-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:60 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(3)
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:8
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:20
 ; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:16
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:12
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v1
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v2
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v3
@@ -16345,105 +16334,110 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v17
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v20
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v21
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v22
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v24
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v26
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v27
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v28
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v29
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v30
-; GCN-NEXT:    s_waitcnt vmcnt(7)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v41
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:52
-; GCN-NEXT:    s_waitcnt vmcnt(7)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v24
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v25
+; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v26
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v30
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:52
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v43
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v42
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v41
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:36
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v1
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v3
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v17
-; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v6
-; GCN-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v2
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v1
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v56
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v28
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB29_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.true
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
 ; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v3
 ; GCN-NEXT:    v_or_b32_e32 v1, v1, v27
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
-; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
-; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v6
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v27
-; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
-; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
-; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v9
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v27
-; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
-; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v27
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
+; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
-; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v7
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v27
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
+; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
-; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
+; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v10
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v27
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v27
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
+; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
 ; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
-; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
+; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
-; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v27
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v27
+; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
+; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v27
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
+; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
+; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v27
+; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
+; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
+; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
+; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
+; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
+; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v12
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v27
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
@@ -16488,16 +16482,16 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v39
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
-; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
-; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
+; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
+; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
+; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_add_f32_e32 v27, 0x38000000, v27
 ; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
 ; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
@@ -16510,16 +16504,16 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
 ; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
-; GCN-NEXT:    v_add_f32_e32 v25, 0x38000000, v25
-; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
-; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
-; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
-; GCN-NEXT:    v_add_f32_e32 v26, 0x38000000, v26
-; GCN-NEXT:    v_add_f32_e32 v22, 0x38000000, v22
 ; GCN-NEXT:    v_add_f32_e32 v24, 0x38000000, v24
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
+; GCN-NEXT:    v_add_f32_e32 v26, 0x38000000, v26
+; GCN-NEXT:    v_add_f32_e32 v22, 0x38000000, v22
+; GCN-NEXT:    v_add_f32_e32 v25, 0x38000000, v25
+; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v27
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
@@ -16532,49 +16526,49 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v26
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v24
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v26
+; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v25
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
 ; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
 ; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v52
 ; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v51
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
 ; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
 ; GCN-NEXT:    v_or_b32_e32 v54, v28, v27
 ; GCN-NEXT:    v_or_b32_e32 v52, v30, v29
 ; GCN-NEXT:    v_or_b32_e32 v50, v50, v53
 ; GCN-NEXT:    v_or_b32_e32 v48, v48, v51
 ; GCN-NEXT:    v_or_b32_e32 v38, v38, v49
 ; GCN-NEXT:    v_or_b32_e32 v37, v37, v39
-; GCN-NEXT:    v_or_b32_e32 v20, v20, v25
-; GCN-NEXT:    v_or_b32_e32 v18, v18, v23
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v26
 ; GCN-NEXT:    v_or_b32_e32 v19, v19, v24
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v21
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v26
+; GCN-NEXT:    v_or_b32_e32 v20, v20, v25
+; GCN-NEXT:    v_or_b32_e32 v18, v18, v23
 ; GCN-NEXT:    v_alignbit_b32 v40, v35, v27, 16
 ; GCN-NEXT:    v_alignbit_b32 v55, v33, v29, 16
 ; GCN-NEXT:    v_alignbit_b32 v53, v31, v53, 16
 ; GCN-NEXT:    v_alignbit_b32 v51, v15, v51, 16
-; GCN-NEXT:    v_alignbit_b32 v49, v12, v49, 16
-; GCN-NEXT:    v_alignbit_b32 v39, v8, v39, 16
-; GCN-NEXT:    v_alignbit_b32 v25, v3, v25, 16
-; GCN-NEXT:    v_alignbit_b32 v23, v11, v23, 16
-; GCN-NEXT:    v_alignbit_b32 v26, v7, v26, 16
-; GCN-NEXT:    v_alignbit_b32 v24, v4, v24, 16
-; GCN-NEXT:    v_alignbit_b32 v21, v1, v21, 16
+; GCN-NEXT:    v_alignbit_b32 v49, v11, v49, 16
+; GCN-NEXT:    v_alignbit_b32 v39, v6, v39, 16
+; GCN-NEXT:    v_alignbit_b32 v24, v2, v24, 16
+; GCN-NEXT:    v_alignbit_b32 v21, v13, v21, 16
+; GCN-NEXT:    v_alignbit_b32 v26, v9, v26, 16
+; GCN-NEXT:    v_alignbit_b32 v25, v5, v25, 16
+; GCN-NEXT:    v_alignbit_b32 v23, v1, v23, 16
 ; GCN-NEXT:  .LBB29_2: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v54
@@ -16612,58 +16606,58 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_lshlrev_b32_e32 v49, 16, v49
 ; GCN-NEXT:    v_or_b32_e32 v38, v38, v49
 ; GCN-NEXT:    v_add_i32_e32 v49, vcc, 32, v0
-; GCN-NEXT:    v_and_b32_e32 v12, 0xffff, v12
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v12, v12, v14
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, 36, v0
+; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
+; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
+; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, 36, v0
 ; GCN-NEXT:    v_and_b32_e32 v37, 0xffff, v37
 ; GCN-NEXT:    v_lshlrev_b32_e32 v39, 16, v39
 ; GCN-NEXT:    v_or_b32_e32 v37, v37, v39
 ; GCN-NEXT:    v_add_i32_e32 v39, vcc, 40, v0
-; GCN-NEXT:    v_and_b32_e32 v8, 0xffff, v8
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v8
+; GCN-NEXT:    v_add_i32_e32 v8, vcc, 44, v0
+; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v24
+; GCN-NEXT:    v_add_i32_e32 v24, vcc, 48, v0
+; GCN-NEXT:    v_and_b32_e32 v2, 0xffff, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v2, v2, v4
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, 52, v0
+; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
+; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
+; GCN-NEXT:    v_or_b32_e32 v17, v17, v21
+; GCN-NEXT:    v_add_i32_e32 v21, vcc, 56, v0
+; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
+; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
+; GCN-NEXT:    v_add_i32_e32 v14, vcc, 60, v0
+; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
+; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
+; GCN-NEXT:    v_or_b32_e32 v22, v22, v26
+; GCN-NEXT:    v_add_i32_e32 v26, vcc, 64, v0
+; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
 ; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    v_or_b32_e32 v8, v8, v10
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, 44, v0
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, 0x44, v0
 ; GCN-NEXT:    v_and_b32_e32 v20, 0xffff, v20
 ; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
 ; GCN-NEXT:    v_or_b32_e32 v20, v20, v25
-; GCN-NEXT:    v_add_i32_e32 v25, vcc, 48, v0
-; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
-; GCN-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
-; GCN-NEXT:    v_or_b32_e32 v3, v3, v5
-; GCN-NEXT:    v_add_i32_e32 v5, vcc, 52, v0
+; GCN-NEXT:    v_add_i32_e32 v25, vcc, 0x48, v0
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v7
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, 0x4c, v0
 ; GCN-NEXT:    v_and_b32_e32 v18, 0xffff, v18
 ; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
 ; GCN-NEXT:    v_or_b32_e32 v18, v18, v23
-; GCN-NEXT:    v_add_i32_e32 v23, vcc, 56, v0
-; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v13
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 60, v0
-; GCN-NEXT:    v_and_b32_e32 v22, 0xffff, v22
-; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v22, v22, v26
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 64, v0
-; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
-; GCN-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v9
-; GCN-NEXT:    v_add_i32_e32 v9, vcc, 0x44, v0
-; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v24
-; GCN-NEXT:    v_add_i32_e32 v24, vcc, 0x48, v0
-; GCN-NEXT:    v_and_b32_e32 v4, 0xffff, v4
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    v_or_b32_e32 v4, v4, v6
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, 0x4c, v0
-; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v21, 16, v21
-; GCN-NEXT:    v_or_b32_e32 v17, v17, v21
-; GCN-NEXT:    v_add_i32_e32 v21, vcc, 0x50, v0
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 0x50, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v2
-; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x54, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
+; GCN-NEXT:    v_add_i32_e32 v3, vcc, 0x54, v0
 ; GCN-NEXT:    buffer_store_dword v27, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v28, v29, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v30, v35, s[0:3], 0 offen
@@ -16673,28 +16667,24 @@ define <44 x i16> @bitcast_v44f16_to_v44i16(<44 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v48, v51, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v15, v16, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v38, v49, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v12, v14, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v11, v12, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v37, v39, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v8, v10, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v20, v25, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v3, v5, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v18, v23, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v11, v13, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v22, v26, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v7, v9, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v6, v8, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v19, v24, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v4, v6, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v2, v4, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v17, v21, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v1, v2, s[0:3], 0 offen
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:60 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:64 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:68 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:72 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:76 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:80 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:84 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:88 ; 4-byte Folded Reload
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:92 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_store_dword v13, v14, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v22, v26, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v9, v10, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v20, v25, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v5, v7, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v18, v23, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v1, v3, s[0:3], 0 offen
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:60 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:64 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:68 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:72 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:76 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll
index b60649cc23590..1505075625f4a 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll
@@ -23883,34 +23883,31 @@ define <60 x i16> @bitcast_v60f16_to_v60i16(<60 x half> %a, i32 %b) {
 ; GCN-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:128 ; 4-byte Folded Spill
 ; GCN-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:124 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(2)
-; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:84
 ; GCN-NEXT:    s_waitcnt expcnt(1)
-; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:80
+; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:76
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:100
-; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:96
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:92
-; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:88
-; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:20
-; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:16
-; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:12
-; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:8
-; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:4
-; GCN-NEXT:    buffer_load_dword v47, off, s[0:3], s32
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:120
-; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:84
-; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:80
-; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:76
-; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:72
-; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:68
-; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:64
-; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:60
-; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:56
-; GCN-NEXT:    buffer_load_dword v57, off, s[0:3], s32 offset:52
-; GCN-NEXT:    buffer_load_dword v58, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:72
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:68
+; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:64
+; GCN-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:60
+; GCN-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:56
+; GCN-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:52
+; GCN-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:48
+; GCN-NEXT:    buffer_load_dword v42, off, s[0:3], s32 offset:44
+; GCN-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:40
+; GCN-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:36
+; GCN-NEXT:    buffer_load_dword v40, off, s[0:3], s32 offset:32
+; GCN-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:28
+; GCN-NEXT:    buffer_load_dword v41, off, s[0:3], s32 offset:24
+; GCN-NEXT:    buffer_load_dword v43, off, s[0:3], s32 offset:20
+; GCN-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:16
+; GCN-NEXT:    buffer_load_dword v56, off, s[0:3], s32 offset:12
+; GCN-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v2
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v3
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v4
 ; GCN-NEXT:    s_waitcnt expcnt(0)
@@ -23919,537 +23916,549 @@ define <60 x i16> @bitcast_v60f16_to_v60i16(<60 x half> %a, i32 %b) {
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v6
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v7
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v8
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v9
+; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v10
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v11
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v12
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v13
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:192 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v13
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v14
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v14
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v15
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v16
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v17
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v18
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v19
-; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v20
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v17
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:192 ; 4-byte Folded Spill
 ; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v18
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v19
+; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v20
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v21
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v22
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v23
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v22
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v23
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v24
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v25
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v25
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v26
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v27
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v28
-; GCN-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v29
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v30
-; GCN-NEXT:    s_waitcnt vmcnt(14)
-; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v50
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:44
-; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:40
-; GCN-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:36
-; GCN-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:32
-; GCN-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:28
-; GCN-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:24
-; GCN-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:116
-; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v55
-; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v41
-; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v53
-; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v51
-; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v13
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v10
-; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v6
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v58
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v57
-; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v56
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v43
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:4
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:120
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:104
+; GCN-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:108
+; GCN-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:112
+; GCN-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:116
+; GCN-NEXT:    s_waitcnt vmcnt(4)
+; GCN-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v4
+; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v3
+; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v46
+; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v56
+; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v44
+; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v43
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v41
+; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
+; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v45
+; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v40
+; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v55
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v53
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v42
-; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v40
-; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v54
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v52
-; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v49
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v48
-; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v39
-; GCN-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:112
-; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v63
-; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v62
-; GCN-NEXT:    v_cvt_f16_f32_e32 v62, v61
+; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v52
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v51
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v49
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v54
+; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v39
+; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v63
+; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v62
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v61
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:88
+; GCN-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:92
+; GCN-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:96
+; GCN-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:100
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v2
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v3
+; GCN-NEXT:    s_waitcnt vmcnt(1)
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v4
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v39
+; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v30
 ; GCN-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GCN-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GCN-NEXT:    s_or_saveexec_b64 s[4:5], s[4:5]
-; GCN-NEXT:    v_mov_b32_e32 v61, v27
-; GCN-NEXT:    v_mov_b32_e32 v63, v2
-; GCN-NEXT:    v_mov_b32_e32 v45, v1
-; GCN-NEXT:    v_mov_b32_e32 v56, v8
+; GCN-NEXT:    v_mov_b32_e32 v62, v24
+; GCN-NEXT:    v_mov_b32_e32 v63, v1
+; GCN-NEXT:    v_mov_b32_e32 v44, v47
+; GCN-NEXT:    v_mov_b32_e32 v61, v5
+; GCN-NEXT:    v_mov_b32_e32 v41, v60
 ; GCN-NEXT:    s_xor_b64 exec, exec, s[4:5]
 ; GCN-NEXT:    s_cbranch_execz .LBB29_2
 ; GCN-NEXT:  ; %bb.1: ; %cmp.true
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v52
-; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
-; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
-; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
-; GCN-NEXT:    v_cvt_f16_f32_e32 v52, v50
-; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v52
-; GCN-NEXT:    v_or_b32_e32 v48, v48, v50
+; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
+; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v2
+; GCN-NEXT:    v_or_b32_e32 v1, v23, v28
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:216 ; 4-byte Folded Spill
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v4, v4
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v54
+; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
 ; GCN-NEXT:    v_add_f32_e32 v4, 0x38000000, v4
-; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
+; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
-; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v4
-; GCN-NEXT:    v_or_b32_e32 v54, v50, v54
-; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v5
-; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v4
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
+; GCN-NEXT:    v_cvt_f32_f16_e32 v5, v52
+; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
 ; GCN-NEXT:    v_add_f32_e32 v5, 0x38000000, v5
-; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v6
-; GCN-NEXT:    v_or_b32_e32 v5, v5, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v7
+; GCN-NEXT:    v_or_b32_e32 v52, v5, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
+; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
 ; GCN-NEXT:    v_add_f32_e32 v9, 0x38000000, v9
-; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
+; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v9, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v10
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v11
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v13, v13
-; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
+; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
 ; GCN-NEXT:    v_add_f32_e32 v13, 0x38000000, v13
-; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
+; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v13, v13
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v15
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v17, v17
 ; GCN-NEXT:    v_add_f32_e32 v18, 0x38000000, v18
 ; GCN-NEXT:    v_add_f32_e32 v17, 0x38000000, v17
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v17, v17
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v18
-; GCN-NEXT:    v_or_b32_e32 v17, v17, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v18
+; GCN-NEXT:    v_or_b32_e32 v17, v17, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v20, v20
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v19, v19
 ; GCN-NEXT:    v_add_f32_e32 v20, 0x38000000, v20
 ; GCN-NEXT:    v_add_f32_e32 v19, 0x38000000, v19
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v20
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v19, v19
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v20
-; GCN-NEXT:    v_or_b32_e32 v19, v19, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v20
+; GCN-NEXT:    v_or_b32_e32 v19, v19, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v22, v22
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v21, v21
 ; GCN-NEXT:    v_add_f32_e32 v22, 0x38000000, v22
 ; GCN-NEXT:    v_add_f32_e32 v21, 0x38000000, v21
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v22, v22
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v21, v21
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v22
-; GCN-NEXT:    v_or_b32_e32 v21, v21, v50
-; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v7, v7
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v22
+; GCN-NEXT:    v_or_b32_e32 v21, v21, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v8, v8
+; GCN-NEXT:    v_cvt_f32_f16_e32 v6, v6
 ; GCN-NEXT:    v_add_f32_e32 v8, 0x38000000, v8
-; GCN-NEXT:    v_add_f32_e32 v7, 0x38000000, v7
-; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v8
-; GCN-NEXT:    v_cvt_f16_f32_e32 v7, v7
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v1
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v50
+; GCN-NEXT:    v_add_f32_e32 v6, 0x38000000, v6
+; GCN-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; GCN-NEXT:    v_cvt_f16_f32_e32 v6, v6
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v8
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GCN-NEXT:    v_cvt_f32_f16_e32 v11, v11
+; GCN-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GCN-NEXT:    v_add_f32_e32 v12, 0x38000000, v12
-; GCN-NEXT:    v_add_f32_e32 v11, 0x38000000, v11
+; GCN-NEXT:    v_add_f32_e32 v10, 0x38000000, v10
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v12, v12
-; GCN-NEXT:    v_cvt_f16_f32_e32 v11, v11
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v12
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v10, v10
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v12
+; GCN-NEXT:    v_or_b32_e32 v10, v10, v23
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v16, v16
-; GCN-NEXT:    v_cvt_f32_f16_e32 v15, v15
+; GCN-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; GCN-NEXT:    v_add_f32_e32 v16, 0x38000000, v16
-; GCN-NEXT:    v_add_f32_e32 v15, 0x38000000, v15
+; GCN-NEXT:    v_add_f32_e32 v14, 0x38000000, v14
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v16, v16
-; GCN-NEXT:    v_cvt_f16_f32_e32 v15, v15
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v15, v15, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v32, v32
-; GCN-NEXT:    v_cvt_f32_f16_e32 v31, v31
-; GCN-NEXT:    v_add_f32_e32 v32, 0x38000000, v32
-; GCN-NEXT:    v_add_f32_e32 v31, 0x38000000, v31
-; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v32
-; GCN-NEXT:    v_cvt_f16_f32_e32 v31, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v34, v34
-; GCN-NEXT:    v_cvt_f32_f16_e32 v33, v33
-; GCN-NEXT:    v_add_f32_e32 v34, 0x38000000, v34
-; GCN-NEXT:    v_add_f32_e32 v33, 0x38000000, v33
-; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v34
-; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v34
-; GCN-NEXT:    v_or_b32_e32 v33, v33, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v36, v36
-; GCN-NEXT:    v_cvt_f32_f16_e32 v35, v35
-; GCN-NEXT:    v_add_f32_e32 v36, 0x38000000, v36
-; GCN-NEXT:    v_add_f32_e32 v35, 0x38000000, v35
-; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v36
-; GCN-NEXT:    v_cvt_f16_f32_e32 v35, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v36
-; GCN-NEXT:    v_or_b32_e32 v35, v35, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v38, v38
-; GCN-NEXT:    v_cvt_f32_f16_e32 v37, v37
-; GCN-NEXT:    v_add_f32_e32 v38, 0x38000000, v38
-; GCN-NEXT:    v_add_f32_e32 v37, 0x38000000, v37
-; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v38
-; GCN-NEXT:    v_cvt_f16_f32_e32 v37, v37
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v38
-; GCN-NEXT:    v_or_b32_e32 v37, v37, v50
-; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v60
+; GCN-NEXT:    v_cvt_f16_f32_e32 v14, v14
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v16
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v32
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v31
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v32, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v32
+; GCN-NEXT:    v_or_b32_e32 v31, v23, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v34
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v33
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v34, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v33, v23, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v36
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v35
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v36, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v36
+; GCN-NEXT:    v_or_b32_e32 v35, v23, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v38
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v37
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v38, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v28
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v38
+; GCN-NEXT:    v_or_b32_e32 v37, v23, v28
+; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v57
 ; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v59
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v59
+; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v43, v2
-; GCN-NEXT:    v_cvt_f32_f16_e32 v3, v3
-; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v45
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v1, v1
+; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v58
+; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v41
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v2, v2
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v43, v5
+; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v44
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v27
-; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v56
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v46, v5
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v27
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; GCN-NEXT:    v_cvt_f32_f16_e32 v47, v24
+; GCN-NEXT:    v_cvt_f32_f16_e32 v57, v61
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_cvt_f32_f16_e32 v58, v27
-; GCN-NEXT:    v_cvt_f32_f16_e32 v59, v63
+; GCN-NEXT:    v_cvt_f32_f16_e32 v58, v24
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v26, v26
-; GCN-NEXT:    v_cvt_f32_f16_e32 v23, v23
+; GCN-NEXT:    v_cvt_f32_f16_e32 v59, v63
+; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v25
+; GCN-NEXT:    v_cvt_f32_f16_e32 v60, v62
+; GCN-NEXT:    v_cvt_f32_f16_e32 v56, v56
+; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v27
+; GCN-NEXT:    v_cvt_f32_f16_e32 v45, v45
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v25, v25
-; GCN-NEXT:    v_cvt_f32_f16_e32 v60, v61
-; GCN-NEXT:    v_cvt_f32_f16_e32 v47, v47
-; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v55
-; GCN-NEXT:    v_cvt_f32_f16_e32 v44, v44
-; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v51
-; GCN-NEXT:    v_cvt_f32_f16_e32 v41, v41
-; GCN-NEXT:    v_cvt_f32_f16_e32 v30, v30
-; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
-; GCN-NEXT:    v_cvt_f32_f16_e32 v28, v28
-; GCN-NEXT:    v_cvt_f32_f16_e32 v24, v24
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v42, v42
-; GCN-NEXT:    v_cvt_f32_f16_e32 v29, v29
+; GCN-NEXT:    v_cvt_f32_f16_e32 v40, v40
+; GCN-NEXT:    v_cvt_f32_f16_e32 v54, v54
+; GCN-NEXT:    v_cvt_f32_f16_e32 v53, v53
+; GCN-NEXT:    v_cvt_f32_f16_e32 v50, v50
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v49, v49
-; GCN-NEXT:    v_cvt_f32_f16_e32 v27, v62
+; GCN-NEXT:    v_cvt_f32_f16_e32 v55, v55
+; GCN-NEXT:    v_cvt_f32_f16_e32 v48, v48
+; GCN-NEXT:    v_cvt_f32_f16_e32 v51, v51
 ; GCN-NEXT:    v_cvt_f32_f16_e32 v39, v39
-; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
+; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
+; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
 ; GCN-NEXT:    v_add_f32_e32 v1, 0x38000000, v1
-; GCN-NEXT:    v_add_f32_e32 v40, 0x38000000, v40
+; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
+; GCN-NEXT:    v_add_f32_e32 v41, 0x38000000, v41
 ; GCN-NEXT:    v_add_f32_e32 v43, 0x38000000, v43
-; GCN-NEXT:    v_add_f32_e32 v3, 0x38000000, v3
-; GCN-NEXT:    v_add_f32_e32 v45, 0x38000000, v45
-; GCN-NEXT:    v_add_f32_e32 v2, 0x38000000, v2
+; GCN-NEXT:    v_add_f32_e32 v44, 0x38000000, v44
 ; GCN-NEXT:    v_add_f32_e32 v46, 0x38000000, v46
-; GCN-NEXT:    v_add_f32_e32 v56, 0x38000000, v56
+; GCN-NEXT:    v_add_f32_e32 v47, 0x38000000, v47
 ; GCN-NEXT:    v_add_f32_e32 v57, 0x38000000, v57
 ; GCN-NEXT:    v_add_f32_e32 v58, 0x38000000, v58
-; GCN-NEXT:    v_add_f32_e32 v59, 0x38000000, v59
 ; GCN-NEXT:    v_add_f32_e32 v26, 0x38000000, v26
-; GCN-NEXT:    v_add_f32_e32 v23, 0x38000000, v23
-; GCN-NEXT:    v_add_f32_e32 v25, 0x38000000, v25
-; GCN-NEXT:    v_add_f32_e32 v60, 0x38000000, v60
-; GCN-NEXT:    v_add_f32_e32 v47, 0x38000000, v47
-; GCN-NEXT:    v_add_f32_e32 v55, 0x38000000, v55
-; GCN-NEXT:    v_add_f32_e32 v44, 0x38000000, v44
-; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
-; GCN-NEXT:    v_add_f32_e32 v41, 0x38000000, v41
-; GCN-NEXT:    v_add_f32_e32 v30, 0x38000000, v30
-; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
-; GCN-NEXT:    v_add_f32_e32 v28, 0x38000000, v28
+; GCN-NEXT:    v_add_f32_e32 v59, 0x38000000, v59
 ; GCN-NEXT:    v_add_f32_e32 v24, 0x38000000, v24
+; GCN-NEXT:    v_add_f32_e32 v60, 0x38000000, v60
+; GCN-NEXT:    v_add_f32_e32 v56, 0x38000000, v56
+; GCN-NEXT:    v_add_f32_e32 v27, 0x38000000, v27
+; GCN-NEXT:    v_add_f32_e32 v45, 0x38000000, v45
+; GCN-NEXT:    v_add_f32_e32 v25, 0x38000000, v25
 ; GCN-NEXT:    v_add_f32_e32 v42, 0x38000000, v42
-; GCN-NEXT:    v_add_f32_e32 v29, 0x38000000, v29
+; GCN-NEXT:    v_add_f32_e32 v40, 0x38000000, v40
+; GCN-NEXT:    v_add_f32_e32 v54, 0x38000000, v54
+; GCN-NEXT:    v_add_f32_e32 v53, 0x38000000, v53
+; GCN-NEXT:    v_add_f32_e32 v50, 0x38000000, v50
 ; GCN-NEXT:    v_add_f32_e32 v49, 0x38000000, v49
-; GCN-NEXT:    v_add_f32_e32 v27, 0x38000000, v27
+; GCN-NEXT:    v_add_f32_e32 v55, 0x38000000, v55
+; GCN-NEXT:    v_add_f32_e32 v48, 0x38000000, v48
+; GCN-NEXT:    v_add_f32_e32 v51, 0x38000000, v51
 ; GCN-NEXT:    v_add_f32_e32 v39, 0x38000000, v39
-; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
+; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
+; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v40
+; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
+; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v41
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v43, v43
-; GCN-NEXT:    v_cvt_f16_f32_e32 v3, v3
-; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v45
-; GCN-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v44
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v46, v46
-; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v56
+; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v47
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v57
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v58
-; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v59
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v26, v26
-; GCN-NEXT:    v_cvt_f16_f32_e32 v23, v23
-; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v25
-; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
-; GCN-NEXT:    v_cvt_f16_f32_e32 v47, v47
-; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v55
-; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v44
-; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
-; GCN-NEXT:    v_cvt_f16_f32_e32 v41, v41
-; GCN-NEXT:    v_cvt_f16_f32_e32 v30, v30
-; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v53
-; GCN-NEXT:    v_cvt_f16_f32_e32 v28, v28
+; GCN-NEXT:    v_cvt_f16_f32_e32 v59, v59
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v24, v24
+; GCN-NEXT:    v_cvt_f16_f32_e32 v60, v60
+; GCN-NEXT:    v_cvt_f16_f32_e32 v56, v56
+; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v45
+; GCN-NEXT:    v_cvt_f16_f32_e32 v25, v25
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v42, v42
-; GCN-NEXT:    v_cvt_f16_f32_e32 v29, v29
+; GCN-NEXT:    v_cvt_f16_f32_e32 v40, v40
+; GCN-NEXT:    v_cvt_f16_f32_e32 v54, v54
+; GCN-NEXT:    v_cvt_f16_f32_e32 v53, v53
+; GCN-NEXT:    v_cvt_f16_f32_e32 v50, v50
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v49, v49
-; GCN-NEXT:    v_cvt_f16_f32_e32 v27, v27
+; GCN-NEXT:    v_cvt_f16_f32_e32 v55, v55
+; GCN-NEXT:    v_cvt_f16_f32_e32 v48, v48
+; GCN-NEXT:    v_cvt_f16_f32_e32 v51, v51
 ; GCN-NEXT:    v_cvt_f16_f32_e32 v39, v39
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v50
-; GCN-NEXT:    v_lshlrev_b32_e32 v40, 16, v40
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v56
-; GCN-NEXT:    v_lshlrev_b32_e32 v58, 16, v58
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v23
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
+; GCN-NEXT:    v_lshlrev_b32_e32 v30, 16, v30
+; GCN-NEXT:    v_lshlrev_b32_e32 v43, 16, v43
+; GCN-NEXT:    v_lshlrev_b32_e32 v46, 16, v46
+; GCN-NEXT:    v_lshlrev_b32_e32 v61, 16, v57
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v41
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
 ; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v50
-; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v1, v43, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v56
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v45
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v54
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v51
+; GCN-NEXT:    v_or_b32_e32 v5, v28, v23
+; GCN-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:212 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v29
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v45, v45, v3
-; GCN-NEXT:    v_or_b32_e32 v43, v46, v2
-; GCN-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:192 ; 4-byte Folded Spill
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    v_or_b32_e32 v43, v57, v56
-; GCN-NEXT:    buffer_store_dword v43, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
-; GCN-NEXT:    v_or_b32_e32 v63, v59, v58
-; GCN-NEXT:    v_or_b32_e32 v23, v23, v26
-; GCN-NEXT:    v_or_b32_e32 v61, v60, v25
-; GCN-NEXT:    v_or_b32_e32 v55, v55, v47
-; GCN-NEXT:    v_or_b32_e32 v51, v51, v44
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v41
-; GCN-NEXT:    v_or_b32_e32 v28, v28, v53
-; GCN-NEXT:    v_or_b32_e32 v42, v42, v24
-; GCN-NEXT:    v_or_b32_e32 v49, v49, v29
-; GCN-NEXT:    v_or_b32_e32 v39, v39, v27
-; GCN-NEXT:    v_alignbit_b32 v60, v37, v50, 16
-; GCN-NEXT:    v_alignbit_b32 v59, v35, v40, 16
-; GCN-NEXT:    v_alignbit_b32 v3, v33, v3, 16
-; GCN-NEXT:    v_alignbit_b32 v1, v31, v2, 16
+; GCN-NEXT:    v_or_b32_e32 v41, v41, v30
+; GCN-NEXT:    v_or_b32_e32 v44, v44, v43
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v1, v47, v46
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:192 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v28, v58, v61
+; GCN-NEXT:    buffer_store_dword v28, off, s[0:3], s32 offset:188 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v63, v59, v26
+; GCN-NEXT:    v_or_b32_e32 v62, v60, v24
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v56
+; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_or_b32_e32 v1, v25, v45
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
+; GCN-NEXT:    v_or_b32_e32 v40, v40, v42
+; GCN-NEXT:    v_or_b32_e32 v53, v53, v54
+; GCN-NEXT:    v_or_b32_e32 v49, v49, v50
+; GCN-NEXT:    v_or_b32_e32 v48, v48, v55
+; GCN-NEXT:    v_or_b32_e32 v39, v39, v51
+; GCN-NEXT:    v_alignbit_b32 v57, v37, v23, 16
+; GCN-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:216 ; 4-byte Folded Reload
+; GCN-NEXT:    v_alignbit_b32 v59, v35, v29, 16
+; GCN-NEXT:    v_alignbit_b32 v58, v33, v30, 16
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v31, v43, 16
 ; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:208 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v56, v15, v56, 16
-; GCN-NEXT:    v_alignbit_b32 v2, v11, v58, 16
-; GCN-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:196 ; 4-byte Folded Spill
-; GCN-NEXT:    v_alignbit_b32 v26, v7, v26, 16
-; GCN-NEXT:    v_alignbit_b32 v25, v21, v25, 16
-; GCN-NEXT:    v_alignbit_b32 v47, v19, v47, 16
-; GCN-NEXT:    v_alignbit_b32 v44, v17, v44, 16
-; GCN-NEXT:    v_alignbit_b32 v41, v13, v41, 16
-; GCN-NEXT:    v_alignbit_b32 v53, v9, v53, 16
-; GCN-NEXT:    v_alignbit_b32 v24, v5, v24, 16
-; GCN-NEXT:    v_alignbit_b32 v29, v54, v29, 16
-; GCN-NEXT:    v_alignbit_b32 v62, v48, v27, 16
+; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    v_alignbit_b32 v1, v14, v46, 16
+; GCN-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:200 ; 4-byte Folded Spill
+; GCN-NEXT:    v_alignbit_b32 v61, v10, v61, 16
+; GCN-NEXT:    v_alignbit_b32 v26, v6, v26, 16
+; GCN-NEXT:    v_alignbit_b32 v25, v21, v24, 16
+; GCN-NEXT:    v_alignbit_b32 v56, v19, v56, 16
+; GCN-NEXT:    v_alignbit_b32 v45, v17, v45, 16
+; GCN-NEXT:    v_alignbit_b32 v42, v13, v42, 16
+; GCN-NEXT:    v_alignbit_b32 v54, v9, v54, 16
+; GCN-NEXT:    v_alignbit_b32 v50, v52, v50, 16
+; GCN-NEXT:    v_alignbit_b32 v55, v3, v55, 16
+; GCN-NEXT:    s_waitcnt vmcnt(2)
+; GCN-NEXT:    v_alignbit_b32 v51, v23, v51, 16
 ; GCN-NEXT:  .LBB29_2: ; %end
 ; GCN-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT:    s_waitcnt expcnt(1)
+; GCN-NEXT:    v_mov_b32_e32 v60, v2
+; GCN-NEXT:    v_mov_b32_e32 v2, v23
+; GCN-NEXT:    s_waitcnt expcnt(0)
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:212 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v60
-; GCN-NEXT:    v_or_b32_e32 v57, v1, v50
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v37
-; GCN-NEXT:    v_lshlrev_b32_e32 v37, 16, v38
-; GCN-NEXT:    v_or_b32_e32 v37, v1, v37
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v1
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v57
+; GCN-NEXT:    v_or_b32_e32 v57, v23, v28
+; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v37
+; GCN-NEXT:    v_lshlrev_b32_e32 v28, 16, v38
+; GCN-NEXT:    v_or_b32_e32 v37, v23, v28
 ; GCN-NEXT:    v_add_i32_e32 v38, vcc, 4, v0
 ; GCN-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:204 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v59
-; GCN-NEXT:    v_or_b32_e32 v46, v1, v50
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v59
+; GCN-NEXT:    v_or_b32_e32 v46, v1, v23
 ; GCN-NEXT:    v_add_i32_e32 v59, vcc, 8, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v35
-; GCN-NEXT:    v_lshlrev_b32_e32 v35, 16, v36
-; GCN-NEXT:    v_or_b32_e32 v35, v1, v35
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v36
+; GCN-NEXT:    v_or_b32_e32 v35, v1, v23
 ; GCN-NEXT:    v_add_i32_e32 v36, vcc, 12, v0
-; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v45
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GCN-NEXT:    v_or_b32_e32 v43, v1, v3
+; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v41
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v58
+; GCN-NEXT:    v_or_b32_e32 v43, v1, v23
 ; GCN-NEXT:    v_add_i32_e32 v58, vcc, 16, v0
 ; GCN-NEXT:    v_and_b32_e32 v1, 0xffff, v33
-; GCN-NEXT:    v_lshlrev_b32_e32 v3, 16, v34
-; GCN-NEXT:    v_or_b32_e32 v1, v1, v3
-; GCN-NEXT:    v_add_i32_e32 v3, vcc, 20, v0
-; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
+; GCN-NEXT:    v_lshlrev_b32_e32 v23, 16, v34
+; GCN-NEXT:    v_or_b32_e32 v1, v1, v23
+; GCN-NEXT:    v_add_i32_e32 v23, vcc, 20, v0
+; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v44
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v5
+; GCN-NEXT:    v_or_b32_e32 v28, v28, v29
+; GCN-NEXT:    v_add_i32_e32 v29, vcc, 24, v0
+; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v31
+; GCN-NEXT:    v_lshlrev_b32_e32 v31, 16, v32
+; GCN-NEXT:    v_or_b32_e32 v30, v30, v31
+; GCN-NEXT:    v_add_i32_e32 v31, vcc, 28, v0
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:192 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v33, 0xffff, v2
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:208 ; 4-byte Folded Reload
+; GCN-NEXT:    v_and_b32_e32 v32, 0xffff, v24
+; GCN-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GCN-NEXT:    v_or_b32_e32 v2, v33, v2
-; GCN-NEXT:    v_add_i32_e32 v33, vcc, 24, v0
-; GCN-NEXT:    v_and_b32_e32 v31, 0xffff, v31
-; GCN-NEXT:    v_lshlrev_b32_e32 v32, 16, v32
-; GCN-NEXT:    v_or_b32_e32 v31, v31, v32
-; GCN-NEXT:    v_add_i32_e32 v32, vcc, 28, v0
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_and_b32_e32 v34, 0xffff, v27
-; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v56
-; GCN-NEXT:    v_or_b32_e32 v34, v34, v50
-; GCN-NEXT:    v_add_i32_e32 v50, vcc, 32, v0
-; GCN-NEXT:    v_and_b32_e32 v15, 0xffff, v15
+; GCN-NEXT:    v_lshlrev_b32_e32 v33, 16, v5
+; GCN-NEXT:    v_or_b32_e32 v32, v32, v33
+; GCN-NEXT:    v_add_i32_e32 v33, vcc, 32, v0
+; GCN-NEXT:    v_and_b32_e32 v14, 0xffff, v14
 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 16, v16
-; GCN-NEXT:    v_or_b32_e32 v15, v15, v16
+; GCN-NEXT:    v_or_b32_e32 v14, v14, v16
 ; GCN-NEXT:    v_add_i32_e32 v16, vcc, 36, v0
-; GCN-NEXT:    v_and_b32_e32 v40, 0xffff, v63
-; GCN-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; GCN-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:188 ; 4-byte Folded Reload
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v27
-; GCN-NEXT:    v_or_b32_e32 v40, v40, v45
-; GCN-NEXT:    v_add_i32_e32 v45, vcc, 40, v0
-; GCN-NEXT:    v_and_b32_e32 v11, 0xffff, v11
+; GCN-NEXT:    v_and_b32_e32 v34, 0xffff, v24
+; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v61
+; GCN-NEXT:    v_or_b32_e32 v34, v34, v41
+; GCN-NEXT:    v_add_i32_e32 v41, vcc, 40, v0
+; GCN-NEXT:    v_and_b32_e32 v10, 0xffff, v10
 ; GCN-NEXT:    v_lshlrev_b32_e32 v12, 16, v12
-; GCN-NEXT:    v_or_b32_e32 v11, v11, v12
+; GCN-NEXT:    v_or_b32_e32 v10, v10, v12
 ; GCN-NEXT:    v_add_i32_e32 v12, vcc, 44, v0
-; GCN-NEXT:    v_and_b32_e32 v23, 0xffff, v23
+; GCN-NEXT:    v_and_b32_e32 v44, 0xffff, v63
 ; GCN-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
-; GCN-NEXT:    v_or_b32_e32 v23, v23, v26
-; GCN-NEXT:    v_add_i32_e32 v26, vcc, 48, v0
-; GCN-NEXT:    v_and_b32_e32 v7, 0xffff, v7
-; GCN-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:200 ; 4-byte Folded Reload
-; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_or_b32_e32 v26, v44, v26
+; GCN-NEXT:    v_add_i32_e32 v44, vcc, 48, v0
+; GCN-NEXT:    v_and_b32_e32 v6, 0xffff, v6
 ; GCN-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
-; GCN-NEXT:    v_or_b32_e32 v7, v7, v8
+; GCN-NEXT:    v_or_b32_e32 v6, v6, v8
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 52, v0
-; GCN-NEXT:    v_and_b32_e32 v56, 0xffff, v61
-; GCN-NEXT:    v_lshlrev_b32_e32 v25, 16, v25
-; GCN-NEXT:    v_or_b32_e32 v25, v56, v25
-; GCN-NEXT:    v_add_i32_e32 v56, vcc, 56, v0
+; GCN-NEXT:    v_and_b32_e32 v47, 0xffff, v62
+; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v25
+; GCN-NEXT:    v_or_b32_e32 v24, v47, v24
+; GCN-NEXT:    v_add_i32_e32 v47, vcc, 56, v0
 ; GCN-NEXT:    v_and_b32_e32 v21, 0xffff, v21
 ; GCN-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
 ; GCN-NEXT:    v_or_b32_e32 v21, v21, v22
 ; GCN-NEXT:    v_add_i32_e32 v22, vcc, 60, v0
-; GCN-NEXT:    v_and_b32_e32 v55, 0xffff, v55
-; GCN-NEXT:    v_lshlrev_b32_e32 v47, 16, v47
-; GCN-NEXT:    v_or_b32_e32 v55, v55, v47
-; GCN-NEXT:    v_add_i32_e32 v47, vcc, 64, v0
+; GCN-NEXT:    v_and_b32_e32 v27, 0xffff, v27
+; GCN-NEXT:    v_lshlrev_b32_e32 v56, 16, v56
+; GCN-NEXT:    v_or_b32_e32 v27, v27, v56
+; GCN-NEXT:    v_add_i32_e32 v56, vcc, 64, v0
 ; GCN-NEXT:    v_and_b32_e32 v19, 0xffff, v19
 ; GCN-NEXT:    v_lshlrev_b32_e32 v20, 16, v20
 ; GCN-NEXT:    v_or_b32_e32 v19, v19, v20
 ; GCN-NEXT:    v_add_i32_e32 v20, vcc, 0x44, v0
-; GCN-NEXT:    v_and_b32_e32 v51, 0xffff, v51
-; GCN-NEXT:    v_lshlrev_b32_e32 v44, 16, v44
-; GCN-NEXT:    v_or_b32_e32 v51, v51, v44
-; GCN-NEXT:    v_add_i32_e32 v44, vcc, 0x48, v0
+; GCN-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:196 ; 4-byte Folded Reload
+; GCN-NEXT:    s_waitcnt vmcnt(0)
+; GCN-NEXT:    v_and_b32_e32 v25, 0xffff, v25
+; GCN-NEXT:    v_lshlrev_b32_e32 v45, 16, v45
+; GCN-NEXT:    v_or_b32_e32 v25, v25, v45
+; GCN-NEXT:    v_add_i32_e32 v45, vcc, 0x48, v0
 ; GCN-NEXT:    v_and_b32_e32 v17, 0xffff, v17
 ; GCN-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GCN-NEXT:    v_or_b32_e32 v17, v17, v18
 ; GCN-NEXT:    v_add_i32_e32 v18, vcc, 0x4c, v0
-; GCN-NEXT:    v_and_b32_e32 v30, 0xffff, v30
-; GCN-NEXT:    v_lshlrev_b32_e32 v41, 16, v41
-; GCN-NEXT:    v_or_b32_e32 v30, v30, v41
-; GCN-NEXT:    v_add_i32_e32 v41, vcc, 0x50, v0
+; GCN-NEXT:    v_and_b32_e32 v40, 0xffff, v40
+; GCN-NEXT:    v_lshlrev_b32_e32 v42, 16, v42
+; GCN-NEXT:    v_or_b32_e32 v40, v40, v42
+; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x50, v0
 ; GCN-NEXT:    v_and_b32_e32 v13, 0xffff, v13
-; GCN-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
-; GCN-NEXT:    v_or_b32_e32 v13, v13, v14
-; GCN-NEXT:    v_add_i32_e32 v14, vcc, 0x54, v0
-; GCN-NEXT:    v_and_b32_e32 v28, 0xffff, v28
-; GCN-NEXT:    v_lshlrev_b32_e32 v53, 16, v53
-; GCN-NEXT:    v_or_b32_e32 v28, v28, v53
-; GCN-NEXT:    v_add_i32_e32 v53, vcc, 0x58, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v15, 16, v15
+; GCN-NEXT:    v_or_b32_e32 v13, v13, v15
+; GCN-NEXT:    v_add_i32_e32 v15, vcc, 0x54, v0
+; GCN-NEXT:    v_and_b32_e32 v53, 0xffff, v53
+; GCN-NEXT:    v_lshlrev_b32_e32 v54, 16, v54
+; GCN-NEXT:    v_or_b32_e32 v53, v53, v54
+; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x58, v0
 ; GCN-NEXT:    v_and_b32_e32 v9, 0xffff, v9
-; GCN-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
-; GCN-NEXT:    v_or_b32_e32 v9, v9, v10
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, 0x5c, v0
-; GCN-NEXT:    v_and_b32_e32 v42, 0xffff, v42
-; GCN-NEXT:    v_lshlrev_b32_e32 v24, 16, v24
-; GCN-NEXT:    v_or_b32_e32 v24, v42, v24
-; GCN-NEXT:    v_add_i32_e32 v42, vcc, 0x60, v0
-; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v5
-; GCN-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GCN-NEXT:    v_or_b32_e32 v5, v5, v6
-; GCN-NEXT:    v_add_i32_e32 v6, vcc, 0x64, v0
+; GCN-NEXT:    v_lshlrev_b32_e32 v11, 16, v11
+; GCN-NEXT:    v_or_b32_e32 v9, v9, v11
+; GCN-NEXT:    v_add_i32_e32 v11, vcc, 0x5c, v0
 ; GCN-NEXT:    v_and_b32_e32 v49, 0xffff, v49
-; GCN-NEXT:    v_lshlrev_b32_e32 v29, 16, v29
-; GCN-NEXT:    v_or_b32_e32 v29, v49, v29
-; GCN-NEXT:    v_add_i32_e32 v49, vcc, 0x68, v0
-; GCN-NEXT:    v_and_b32_e32 v54, 0xffff, v54
+; GCN-NEXT:    v_lshlrev_b32_e32 v50, 16, v50
+; GCN-NEXT:    v_or_b32_e32 v49, v49, v50
+; GCN-NEXT:    v_add_i32_e32 v50, vcc, 0x60, v0
+; GCN-NEXT:    v_and_b32_e32 v5, 0xffff, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
+; GCN-NEXT:    v_or_b32_e32 v5, v5, v7
+; GCN-NEXT:    v_add_i32_e32 v7, vcc, 0x64, v0
+; GCN-NEXT:    v_and_b32_e32 v48, 0xffff, v48
+; GCN-NEXT:    v_lshlrev_b32_e32 v55, 16, v55
+; GCN-NEXT:    v_or_b32_e32 v48, v48, v55
+; GCN-NEXT:    v_add_i32_e32 v55, vcc, 0x68, v0
+; GCN-NEXT:    v_and_b32_e32 v3, 0xffff, v3
 ; GCN-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GCN-NEXT:    v_or_b32_e32 v4, v54, v4
-; GCN-NEXT:    v_add_i32_e32 v54, vcc, 0x6c, v0
+; GCN-NEXT:    v_or_b32_e32 v3, v3, v4
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, 0x6c, v0
 ; GCN-NEXT:    v_and_b32_e32 v39, 0xffff, v39
-; GCN-NEXT:    v_lshlrev_b32_e32 v27, 16, v62
-; GCN-NEXT:    v_or_b32_e32 v27, v39, v27
-; GCN-NEXT:    v_add_i32_e32 v39, vcc, 0x70, v0
-; GCN-NEXT:    v_and_b32_e32 v48, 0xffff, v48
-; GCN-NEXT:    v_lshlrev_b32_e32 v52, 16, v52
-; GCN-NEXT:    v_or_b32_e32 v48, v48, v52
+; GCN-NEXT:    v_lshlrev_b32_e32 v51, 16, v51
+; GCN-NEXT:    v_or_b32_e32 v39, v39, v51
+; GCN-NEXT:    v_add_i32_e32 v51, vcc, 0x70, v0
+; GCN-NEXT:    v_and_b32_e32 v52, 0xffff, v2
+; GCN-NEXT:    v_lshlrev_b32_e32 v2, 16, v60
+; GCN-NEXT:    v_or_b32_e32 v2, v52, v2
 ; GCN-NEXT:    v_add_i32_e32 v52, vcc, 0x74, v0
 ; GCN-NEXT:    buffer_store_dword v57, v0, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v37, v38, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v46, v59, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v35, v36, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v43, v58, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v1, v3, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v2, v33, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v31, v32, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v34, v50, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v15, v16, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v40, v45, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v11, v12, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v23, v26, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v7, v8, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v25, v56, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v1, v23, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v28, v29, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v30, v31, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v32, v33, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v14, v16, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v34, v41, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v10, v12, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v26, v44, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v6, v8, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v24, v47, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v21, v22, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v55, v47, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v27, v56, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v19, v20, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v51, v44, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v25, v45, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_store_dword v17, v18, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v30, v41, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v13, v14, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v28, v53, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v9, v10, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v24, v42, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v5, v6, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v29, v49, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v4, v54, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v27, v39, s[0:3], 0 offen
-; GCN-NEXT:    buffer_store_dword v48, v52, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v40, v42, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v13, v15, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v53, v54, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v9, v11, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v49, v50, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v5, v7, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v48, v55, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v3, v4, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v39, v51, s[0:3], 0 offen
+; GCN-NEXT:    buffer_store_dword v2, v52, s[0:3], 0 offen
 ; GCN-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:124 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v62, off, s[0:3], s32 offset:128 ; 4-byte Folded Reload
 ; GCN-NEXT:    buffer_load_dword v61, off, s[0:3], s32 offset:132 ; 4-byte Folded Reload
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll
index 8ca3e8255b634..6e8a5a1266a15 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll
@@ -937,8 +937,8 @@ define amdgpu_cs_chain void @amdgpu_cs_chain_dont_realign_stack(i32 %idx) {
 ; GISEL-GFX11-NEXT:    v_lshlrev_b32_e32 v0, 4, v8
 ; GISEL-GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GISEL-GFX11-NEXT:    v_mov_b32_e32 v4, v0
-; GISEL-GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v3, s3
-; GISEL-GFX11-NEXT:    v_dual_mov_b32 v1, s1 :: v_dual_mov_b32 v2, s2
+; GISEL-GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GISEL-GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
 ; GISEL-GFX11-NEXT:    scratch_store_b128 v4, v[0:3], off dlc
 ; GISEL-GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GISEL-GFX11-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll
index 4ba9f0729ea1f..2d4f7485c6576 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll
@@ -590,8 +590,8 @@ define amdgpu_cs_chain_preserve void @amdgpu_cs_chain_preserve_dont_realign_stac
 ; GISEL-GFX11-NEXT:    v_lshlrev_b32_e32 v0, 4, v8
 ; GISEL-GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GISEL-GFX11-NEXT:    v_mov_b32_e32 v4, v0
-; GISEL-GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v3, s3
-; GISEL-GFX11-NEXT:    v_dual_mov_b32 v1, s1 :: v_dual_mov_b32 v2, s2
+; GISEL-GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GISEL-GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
 ; GISEL-GFX11-NEXT:    scratch_store_b128 v4, v[0:3], off dlc
 ; GISEL-GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GISEL-GFX11-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
index 9775a37276dfd..2d3a941b8a516 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
@@ -12122,11 +12122,11 @@ define amdgpu_kernel void @max_i64_varying(ptr addrspace(1) %out) {
 ; GFX1132_DPP-NEXT:    v_bfrev_b32_e32 v5, 1
 ; GFX1132_DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX1132_DPP-NEXT:    v_cmp_gt_i64_e32 vcc_lo, v[1:2], v[3:4]
-; GFX1132_DPP-NEXT:    v_cndmask_b32_e32 v2, v4, v2, vcc_lo
-; GFX1132_DPP-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_cndmask_b32 v1, v3, v1
+; GFX1132_DPP-NEXT:    v_dual_cndmask_b32 v2, v4, v2 :: v_dual_cndmask_b32 v1, v3, v1
+; GFX1132_DPP-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX1132_DPP-NEXT:    v_readlane_b32 s3, v2, 15
 ; GFX1132_DPP-NEXT:    v_readlane_b32 s1, v2, 31
-; GFX1132_DPP-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
+; GFX1132_DPP-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX1132_DPP-NEXT:    v_readlane_b32 s0, v1, 31
 ; GFX1132_DPP-NEXT:    v_mov_b32_dpp v5, v2 row_shr:1 row_mask:0xf bank_mask:0xf
 ; GFX1132_DPP-NEXT:    v_mov_b32_dpp v4, v1 row_shr:1 row_mask:0xf bank_mask:0xf
@@ -13950,11 +13950,11 @@ define amdgpu_kernel void @min_i64_varying(ptr addrspace(1) %out) {
 ; GFX1132_DPP-NEXT:    v_bfrev_b32_e32 v5, -2
 ; GFX1132_DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_2)
 ; GFX1132_DPP-NEXT:    v_cmp_lt_i64_e32 vcc_lo, v[1:2], v[3:4]
-; GFX1132_DPP-NEXT:    v_cndmask_b32_e32 v2, v4, v2, vcc_lo
-; GFX1132_DPP-NEXT:    v_dual_mov_b32 v4, -1 :: v_dual_cndmask_b32 v1, v3, v1
+; GFX1132_DPP-NEXT:    v_dual_cndmask_b32 v2, v4, v2 :: v_dual_cndmask_b32 v1, v3, v1
+; GFX1132_DPP-NEXT:    v_mov_b32_e32 v4, -1
 ; GFX1132_DPP-NEXT:    v_readlane_b32 s3, v2, 15
 ; GFX1132_DPP-NEXT:    v_readlane_b32 s1, v2, 31
-; GFX1132_DPP-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
+; GFX1132_DPP-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX1132_DPP-NEXT:    v_readlane_b32 s0, v1, 31
 ; GFX1132_DPP-NEXT:    v_mov_b32_dpp v5, v2 row_shr:1 row_mask:0xf bank_mask:0xf
 ; GFX1132_DPP-NEXT:    v_mov_b32_dpp v4, v1 row_shr:1 row_mask:0xf bank_mask:0xf
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index 19b6ff68b9869..11a9e18936485 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -1428,47 +1428,48 @@ define void @v_store_global_v32bf16(<32 x bfloat> %val, ptr addrspace(1) %ptr) {
 ; GFX7-LABEL: v_store_global_v32bf16:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v3, 1.0, v3
-; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v1
-; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GFX7-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
-; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
-; GFX7-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GFX7-NEXT:    v_alignbit_b32 v3, v3, v2, 16
-; GFX7-NEXT:    v_alignbit_b32 v2, v1, v0, 16
-; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v14
-; GFX7-NEXT:    buffer_load_dword v14, off, s[0:3], s32
 ; GFX7-NEXT:    v_mul_f32_e32 v25, 1.0, v25
 ; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v7
-; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v15
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v25, 16, v25
 ; GFX7-NEXT:    v_mul_f32_e32 v24, 1.0, v24
 ; GFX7-NEXT:    v_mul_f32_e32 v6, 1.0, v6
 ; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v5
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GFX7-NEXT:    v_lshrrev_b32_e32 v0, 16, v0
 ; GFX7-NEXT:    v_alignbit_b32 v25, v25, v24, 16
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v24, 16, v5
 ; GFX7-NEXT:    v_alignbit_b32 v5, v7, v6, 16
-; GFX7-NEXT:    v_mul_f32_e32 v6, 1.0, v13
+; GFX7-NEXT:    buffer_load_dword v6, off, s[0:3], s32
+; GFX7-NEXT:    v_mul_f32_e32 v3, 1.0, v3
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GFX7-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GFX7-NEXT:    v_alignbit_b32 v3, v3, v2, 16
+; GFX7-NEXT:    v_alignbit_b32 v2, v1, v0, 16
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v15
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v14
+; GFX7-NEXT:    v_lshrrev_b32_e32 v0, 16, v0
+; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v13
 ; GFX7-NEXT:    v_alignbit_b32 v13, v0, v1, 16
 ; GFX7-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
 ; GFX7-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v12
-; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
-; GFX7-NEXT:    v_alignbit_b32 v12, v6, v7, 16
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v11
-; GFX7-NEXT:    v_mul_f32_e32 v10, 1.0, v10
-; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GFX7-NEXT:    v_mul_f32_e32 v29, 1.0, v29
-; GFX7-NEXT:    v_alignbit_b32 v11, v7, v10, 16
+; GFX7-NEXT:    v_mul_f32_e32 v12, 1.0, v12
+; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v29, 16, v29
 ; GFX7-NEXT:    v_mul_f32_e32 v28, 1.0, v28
 ; GFX7-NEXT:    v_mul_f32_e32 v27, 1.0, v27
-; GFX7-NEXT:    v_mul_f32_e32 v6, 1.0, v30
+; GFX7-NEXT:    v_alignbit_b32 v12, v7, v12, 16
+; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v30
+; GFX7-NEXT:    v_mul_f32_e32 v11, 1.0, v11
 ; GFX7-NEXT:    v_mul_f32_e32 v9, 1.0, v9
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v31, 16, v27
 ; GFX7-NEXT:    v_alignbit_b32 v27, v29, v28, 16
+; GFX7-NEXT:    v_mul_f32_e32 v10, 1.0, v10
+; GFX7-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
+; GFX7-NEXT:    v_alignbit_b32 v11, v11, v10, 16
+; GFX7-NEXT:    v_mul_f32_e32 v14, 1.0, v20
 ; GFX7-NEXT:    v_mul_f32_e32 v26, 1.0, v26
 ; GFX7-NEXT:    s_mov_b32 s6, 0
 ; GFX7-NEXT:    v_alignbit_b32 v26, v31, v26, 16
@@ -1478,9 +1479,9 @@ define void @v_store_global_v32bf16(<32 x bfloat> %val, ptr addrspace(1) %ptr) {
 ; GFX7-NEXT:    s_mov_b32 s5, s6
 ; GFX7-NEXT:    v_alignbit_b32 v4, v24, v4, 16
 ; GFX7-NEXT:    s_waitcnt vmcnt(2)
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v14
-; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GFX7-NEXT:    v_alignbit_b32 v28, v7, v6, 16
+; GFX7-NEXT:    v_mul_f32_e32 v6, 1.0, v6
+; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
+; GFX7-NEXT:    v_alignbit_b32 v28, v6, v7, 16
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 16, v9
 ; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v8
 ; GFX7-NEXT:    v_alignbit_b32 v10, v6, v7, 16
@@ -1493,7 +1494,6 @@ define void @v_store_global_v32bf16(<32 x bfloat> %val, ptr addrspace(1) %ptr) {
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v18
 ; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GFX7-NEXT:    v_mul_f32_e32 v14, 1.0, v20
 ; GFX7-NEXT:    v_alignbit_b32 v7, v6, v7, 16
 ; GFX7-NEXT:    v_mul_f32_e32 v6, 1.0, v17
 ; GFX7-NEXT:    v_alignbit_b32 v8, v8, v14, 16
@@ -5378,15 +5378,14 @@ define { <32 x i32>, bfloat } @test_overflow_stack(bfloat %a, <32 x i32> %b) {
 ; GCN-LABEL: test_overflow_stack:
 ; GCN:       ; %bb.0:
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:8
 ; GCN-NEXT:    buffer_store_dword v2, v0, s[0:3], 0 offen
 ; GCN-NEXT:    s_waitcnt expcnt(0)
-; GCN-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:8
-; GCN-NEXT:    v_add_i32_e32 v31, vcc, 0x7c, v0
+; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x7c, v0
 ; GCN-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
 ; GCN-NEXT:    buffer_load_dword v33, off, s[0:3], s32
-; GCN-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NEXT:    buffer_store_dword v2, v31, s[0:3], 0 offen
-; GCN-NEXT:    s_waitcnt expcnt(0)
+; GCN-NEXT:    s_waitcnt vmcnt(3)
+; GCN-NEXT:    buffer_store_dword v31, v2, s[0:3], 0 offen
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x78, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    buffer_store_dword v32, v2, s[0:3], 0 offen
@@ -5394,6 +5393,7 @@ define { <32 x i32>, bfloat } @test_overflow_stack(bfloat %a, <32 x i32> %b) {
 ; GCN-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-NEXT:    buffer_store_dword v33, v2, s[0:3], 0 offen
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x70, v0
+; GCN-NEXT:    s_waitcnt expcnt(2)
 ; GCN-NEXT:    v_add_i32_e32 v31, vcc, 0x6c, v0
 ; GCN-NEXT:    buffer_store_dword v30, v2, s[0:3], 0 offen
 ; GCN-NEXT:    v_add_i32_e32 v2, vcc, 0x68, v0
@@ -5633,12 +5633,11 @@ define { <32 x i32>, bfloat } @test_overflow_stack(bfloat %a, <32 x i32> %b) {
 ; GFX9-NEXT:    buffer_store_dword v28, v0, s[0:3], 0 offen offset:104
 ; GFX9-NEXT:    buffer_store_dword v27, v0, s[0:3], 0 offen offset:100
 ; GFX9-NEXT:    buffer_store_dword v26, v0, s[0:3], 0 offen offset:96
-; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    s_nop 0
-; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v25, v0, s[0:3], 0 offen offset:92
-; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dword v24, v0, s[0:3], 0 offen offset:88
 ; GFX9-NEXT:    buffer_store_dword v23, v0, s[0:3], 0 offen offset:84
@@ -5664,10 +5663,11 @@ define { <32 x i32>, bfloat } @test_overflow_stack(bfloat %a, <32 x i32> %b) {
 ; GFX9-NEXT:    buffer_store_dword v3, v0, s[0:3], 0 offen offset:4
 ; GFX9-NEXT:    buffer_store_dword v2, v0, s[0:3], 0 offen
 ; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    buffer_store_dword v27, v0, s[0:3], 0 offen offset:124
+; GFX9-NEXT:    buffer_store_dword v25, v0, s[0:3], 0 offen offset:124
+; GFX9-NEXT:    s_waitcnt vmcnt(25)
 ; GFX9-NEXT:    buffer_store_dword v26, v0, s[0:3], 0 offen offset:120
 ; GFX9-NEXT:    s_waitcnt vmcnt(25)
-; GFX9-NEXT:    buffer_store_dword v25, v0, s[0:3], 0 offen offset:116
+; GFX9-NEXT:    buffer_store_dword v27, v0, s[0:3], 0 offen offset:116
 ; GFX9-NEXT:    buffer_store_short v1, v0, s[0:3], 0 offen offset:128
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
@@ -8479,30 +8479,30 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX9-LABEL: global_extload_v32bf16_to_v32f64:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    global_load_ushort v8, v[1:2], off offset:62
-; GFX9-NEXT:    global_load_ushort v10, v[1:2], off offset:60
-; GFX9-NEXT:    global_load_ushort v11, v[1:2], off offset:58
-; GFX9-NEXT:    global_load_ushort v12, v[1:2], off offset:56
-; GFX9-NEXT:    global_load_ushort v13, v[1:2], off offset:54
-; GFX9-NEXT:    global_load_ushort v14, v[1:2], off offset:52
-; GFX9-NEXT:    global_load_ushort v15, v[1:2], off offset:50
-; GFX9-NEXT:    global_load_ushort v16, v[1:2], off offset:48
-; GFX9-NEXT:    global_load_ushort v17, v[1:2], off offset:46
-; GFX9-NEXT:    global_load_ushort v18, v[1:2], off offset:44
-; GFX9-NEXT:    global_load_ushort v19, v[1:2], off offset:42
-; GFX9-NEXT:    global_load_ushort v20, v[1:2], off offset:40
-; GFX9-NEXT:    global_load_ushort v21, v[1:2], off offset:38
-; GFX9-NEXT:    global_load_ushort v22, v[1:2], off offset:36
-; GFX9-NEXT:    global_load_ushort v23, v[1:2], off offset:34
-; GFX9-NEXT:    global_load_ushort v24, v[1:2], off offset:32
-; GFX9-NEXT:    global_load_ushort v25, v[1:2], off
-; GFX9-NEXT:    global_load_ushort v26, v[1:2], off offset:2
-; GFX9-NEXT:    global_load_ushort v27, v[1:2], off offset:30
+; GFX9-NEXT:    global_load_ushort v9, v[1:2], off offset:62
+; GFX9-NEXT:    global_load_ushort v11, v[1:2], off offset:60
+; GFX9-NEXT:    global_load_ushort v12, v[1:2], off offset:58
+; GFX9-NEXT:    global_load_ushort v13, v[1:2], off offset:56
+; GFX9-NEXT:    global_load_ushort v14, v[1:2], off offset:54
+; GFX9-NEXT:    global_load_ushort v15, v[1:2], off offset:52
+; GFX9-NEXT:    global_load_ushort v16, v[1:2], off offset:50
+; GFX9-NEXT:    global_load_ushort v17, v[1:2], off offset:48
+; GFX9-NEXT:    global_load_ushort v18, v[1:2], off offset:46
+; GFX9-NEXT:    global_load_ushort v19, v[1:2], off offset:44
+; GFX9-NEXT:    global_load_ushort v20, v[1:2], off offset:42
+; GFX9-NEXT:    global_load_ushort v21, v[1:2], off offset:40
+; GFX9-NEXT:    global_load_ushort v22, v[1:2], off offset:38
+; GFX9-NEXT:    global_load_ushort v23, v[1:2], off offset:36
+; GFX9-NEXT:    global_load_ushort v24, v[1:2], off offset:34
+; GFX9-NEXT:    global_load_ushort v25, v[1:2], off offset:32
+; GFX9-NEXT:    global_load_ushort v26, v[1:2], off
+; GFX9-NEXT:    global_load_ushort v27, v[1:2], off offset:2
 ; GFX9-NEXT:    global_load_ushort v3, v[1:2], off offset:16
 ; GFX9-NEXT:    global_load_ushort v4, v[1:2], off offset:18
 ; GFX9-NEXT:    global_load_ushort v5, v[1:2], off offset:20
 ; GFX9-NEXT:    global_load_ushort v6, v[1:2], off offset:22
-; GFX9-NEXT:    global_load_ushort v28, v[1:2], off offset:24
+; GFX9-NEXT:    global_load_ushort v8, v[1:2], off offset:24
+; GFX9-NEXT:    global_load_ushort v28, v[1:2], off offset:30
 ; GFX9-NEXT:    global_load_ushort v29, v[1:2], off offset:26
 ; GFX9-NEXT:    global_load_ushort v30, v[1:2], off offset:28
 ; GFX9-NEXT:    global_load_ushort v31, v[1:2], off offset:4
@@ -8513,122 +8513,122 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    global_load_ushort v1, v[1:2], off offset:14
 ; GFX9-NEXT:    s_waitcnt vmcnt(31)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v8
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v2
+; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v9
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v10
-; GFX9-NEXT:    s_waitcnt vmcnt(28)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v10, 16, v12
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:252
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:248
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v2
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v11
-; GFX9-NEXT:    s_waitcnt vmcnt(29)
+; GFX9-NEXT:    s_waitcnt vmcnt(28)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v11, 16, v13
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:244
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:240
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v2
-; GFX9-NEXT:    s_waitcnt vmcnt(30)
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:252
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:248
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v2
+; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v12
+; GFX9-NEXT:    s_waitcnt vmcnt(29)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v12, 16, v14
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:236
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:232
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v10
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[10:11], v11
-; GFX9-NEXT:    s_waitcnt vmcnt(31)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v13, 16, v15
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:244
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:240
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v14, 16, v16
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:228
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:224
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v12
+; GFX9-NEXT:    v_lshlrev_b32_e32 v13, 16, v15
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:236
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:232
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v11
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[11:12], v12
 ; GFX9-NEXT:    s_waitcnt vmcnt(31)
+; GFX9-NEXT:    v_lshlrev_b32_e32 v14, 16, v16
+; GFX9-NEXT:    s_waitcnt vmcnt(30)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v15, 16, v17
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[12:13], v13
-; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:220
-; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:216
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[10:11], v14
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[14:15], v15
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:228
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:224
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v13
+; GFX9-NEXT:    s_waitcnt vmcnt(31)
+; GFX9-NEXT:    v_lshlrev_b32_e32 v16, 16, v18
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[13:14], v14
+; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:220
+; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:216
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[11:12], v15
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[15:16], v16
 ; GFX9-NEXT:    s_waitcnt vmcnt(32)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v18
+; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v19
 ; GFX9-NEXT:    s_waitcnt vmcnt(30)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v18, 16, v20
+; GFX9-NEXT:    v_lshlrev_b32_e32 v19, 16, v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(28)
+; GFX9-NEXT:    v_lshlrev_b32_e32 v21, 16, v23
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:212
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:208
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v2
+; GFX9-NEXT:    v_lshlrev_b32_e32 v17, 16, v20
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v20, 16, v22
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:212
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:208
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v2
-; GFX9-NEXT:    v_lshlrev_b32_e32 v16, 16, v19
-; GFX9-NEXT:    v_lshlrev_b32_e32 v19, 16, v21
-; GFX9-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:204
-; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:200
-; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:196
-; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:192
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[10:11], v20
+; GFX9-NEXT:    buffer_store_dword v14, v0, s[0:3], 0 offen offset:204
+; GFX9-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:200
+; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:196
+; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:192
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[11:12], v21
 ; GFX9-NEXT:    s_waitcnt vmcnt(33)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v23
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[16:17], v16
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[12:13], v18
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[18:19], v19
-; GFX9-NEXT:    buffer_store_dword v15, v0, s[0:3], 0 offen offset:188
-; GFX9-NEXT:    buffer_store_dword v14, v0, s[0:3], 0 offen offset:184
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:180
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:176
-; GFX9-NEXT:    buffer_store_dword v17, v0, s[0:3], 0 offen offset:172
-; GFX9-NEXT:    buffer_store_dword v16, v0, s[0:3], 0 offen offset:168
-; GFX9-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:164
-; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:160
-; GFX9-NEXT:    buffer_store_dword v19, v0, s[0:3], 0 offen offset:156
-; GFX9-NEXT:    buffer_store_dword v18, v0, s[0:3], 0 offen offset:152
-; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:148
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v2
-; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:144
+; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v24
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[17:18], v17
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[13:14], v19
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[19:20], v20
+; GFX9-NEXT:    buffer_store_dword v16, v0, s[0:3], 0 offen offset:188
+; GFX9-NEXT:    buffer_store_dword v15, v0, s[0:3], 0 offen offset:184
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:180
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:176
+; GFX9-NEXT:    buffer_store_dword v18, v0, s[0:3], 0 offen offset:172
+; GFX9-NEXT:    buffer_store_dword v17, v0, s[0:3], 0 offen offset:168
+; GFX9-NEXT:    buffer_store_dword v14, v0, s[0:3], 0 offen offset:164
+; GFX9-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:160
+; GFX9-NEXT:    buffer_store_dword v20, v0, s[0:3], 0 offen offset:156
+; GFX9-NEXT:    buffer_store_dword v19, v0, s[0:3], 0 offen offset:152
+; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:148
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v2
+; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:144
 ; GFX9-NEXT:    s_waitcnt vmcnt(44)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v10, 16, v24
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:140
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:136
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v10
-; GFX9-NEXT:    s_waitcnt vmcnt(43)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v12, 16, v27
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:132
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:128
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v12
+; GFX9-NEXT:    v_lshlrev_b32_e32 v11, 16, v25
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:140
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:136
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v11
 ; GFX9-NEXT:    s_waitcnt vmcnt(38)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v14, 16, v30
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:124
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:120
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v14
-; GFX9-NEXT:    v_lshlrev_b32_e32 v16, 16, v29
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:116
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:112
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v16
-; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v25
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[10:11], v2
+; GFX9-NEXT:    v_lshlrev_b32_e32 v13, 16, v28
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:132
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:128
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v13
+; GFX9-NEXT:    s_waitcnt vmcnt(38)
+; GFX9-NEXT:    v_lshlrev_b32_e32 v15, 16, v30
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:124
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:120
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v15
+; GFX9-NEXT:    v_lshlrev_b32_e32 v17, 16, v29
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:116
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:112
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[9:10], v17
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v26
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[12:13], v2
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[11:12], v2
+; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v27
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[13:14], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(41)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v31
-; GFX9-NEXT:    v_lshlrev_b32_e32 v18, 16, v28
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[14:15], v2
+; GFX9-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[15:16], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(40)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v32
-; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:108
-; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:104
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v18
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[16:17], v2
+; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:108
+; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:104
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v8
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[17:18], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(41)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v33
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[18:19], v2
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[19:20], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(40)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v34
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[20:21], v2
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[21:22], v2
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v2, 16, v5
 ; GFX9-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:100
 ; GFX9-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:96
 ; GFX9-NEXT:    v_cvt_f64_f32_e32 v[8:9], v6
 ; GFX9-NEXT:    v_cvt_f64_f32_e32 v[5:6], v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(41)
-; GFX9-NEXT:    v_lshlrev_b32_e32 v22, 16, v7
+; GFX9-NEXT:    v_lshlrev_b32_e32 v10, 16, v7
 ; GFX9-NEXT:    s_waitcnt vmcnt(40)
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v7, 16, v1
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v1, 16, v4
@@ -8642,25 +8642,25 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX9-NEXT:    v_cvt_f64_f32_e32 v[6:7], v3
 ; GFX9-NEXT:    buffer_store_dword v2, v0, s[0:3], 0 offen offset:76
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:72
-; GFX9-NEXT:    v_cvt_f64_f32_e32 v[1:2], v22
+; GFX9-NEXT:    v_cvt_f64_f32_e32 v[1:2], v10
 ; GFX9-NEXT:    buffer_store_dword v7, v0, s[0:3], 0 offen offset:68
 ; GFX9-NEXT:    buffer_store_dword v6, v0, s[0:3], 0 offen offset:64
 ; GFX9-NEXT:    buffer_store_dword v5, v0, s[0:3], 0 offen offset:60
 ; GFX9-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen offset:56
 ; GFX9-NEXT:    buffer_store_dword v2, v0, s[0:3], 0 offen offset:52
 ; GFX9-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:48
-; GFX9-NEXT:    buffer_store_dword v21, v0, s[0:3], 0 offen offset:44
-; GFX9-NEXT:    buffer_store_dword v20, v0, s[0:3], 0 offen offset:40
-; GFX9-NEXT:    buffer_store_dword v19, v0, s[0:3], 0 offen offset:36
-; GFX9-NEXT:    buffer_store_dword v18, v0, s[0:3], 0 offen offset:32
-; GFX9-NEXT:    buffer_store_dword v17, v0, s[0:3], 0 offen offset:28
-; GFX9-NEXT:    buffer_store_dword v16, v0, s[0:3], 0 offen offset:24
-; GFX9-NEXT:    buffer_store_dword v15, v0, s[0:3], 0 offen offset:20
-; GFX9-NEXT:    buffer_store_dword v14, v0, s[0:3], 0 offen offset:16
-; GFX9-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:12
-; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:8
-; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:4
-; GFX9-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen
+; GFX9-NEXT:    buffer_store_dword v22, v0, s[0:3], 0 offen offset:44
+; GFX9-NEXT:    buffer_store_dword v21, v0, s[0:3], 0 offen offset:40
+; GFX9-NEXT:    buffer_store_dword v20, v0, s[0:3], 0 offen offset:36
+; GFX9-NEXT:    buffer_store_dword v19, v0, s[0:3], 0 offen offset:32
+; GFX9-NEXT:    buffer_store_dword v18, v0, s[0:3], 0 offen offset:28
+; GFX9-NEXT:    buffer_store_dword v17, v0, s[0:3], 0 offen offset:24
+; GFX9-NEXT:    buffer_store_dword v16, v0, s[0:3], 0 offen offset:20
+; GFX9-NEXT:    buffer_store_dword v15, v0, s[0:3], 0 offen offset:16
+; GFX9-NEXT:    buffer_store_dword v14, v0, s[0:3], 0 offen offset:12
+; GFX9-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:8
+; GFX9-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:4
+; GFX9-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -8861,72 +8861,72 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    s_clause 0x1f
-; GFX11-NEXT:    global_load_u16 v3, v[1:2], off offset:12
-; GFX11-NEXT:    global_load_u16 v4, v[1:2], off offset:8
-; GFX11-NEXT:    global_load_u16 v5, v[1:2], off offset:4
-; GFX11-NEXT:    global_load_u16 v6, v[1:2], off offset:2
+; GFX11-NEXT:    global_load_u16 v3, v[1:2], off offset:2
+; GFX11-NEXT:    global_load_u16 v4, v[1:2], off offset:12
+; GFX11-NEXT:    global_load_u16 v5, v[1:2], off offset:8
+; GFX11-NEXT:    global_load_u16 v6, v[1:2], off offset:4
 ; GFX11-NEXT:    global_load_u16 v7, v[1:2], off
 ; GFX11-NEXT:    global_load_u16 v8, v[1:2], off offset:6
 ; GFX11-NEXT:    global_load_u16 v9, v[1:2], off offset:10
 ; GFX11-NEXT:    global_load_u16 v10, v[1:2], off offset:14
-; GFX11-NEXT:    global_load_u16 v11, v[1:2], off offset:28
-; GFX11-NEXT:    global_load_u16 v12, v[1:2], off offset:24
-; GFX11-NEXT:    global_load_u16 v13, v[1:2], off offset:20
-; GFX11-NEXT:    global_load_u16 v14, v[1:2], off offset:18
+; GFX11-NEXT:    global_load_u16 v11, v[1:2], off offset:18
+; GFX11-NEXT:    global_load_u16 v12, v[1:2], off offset:28
+; GFX11-NEXT:    global_load_u16 v13, v[1:2], off offset:24
+; GFX11-NEXT:    global_load_u16 v14, v[1:2], off offset:20
 ; GFX11-NEXT:    global_load_u16 v15, v[1:2], off offset:16
 ; GFX11-NEXT:    global_load_u16 v16, v[1:2], off offset:22
 ; GFX11-NEXT:    global_load_u16 v17, v[1:2], off offset:26
 ; GFX11-NEXT:    global_load_u16 v18, v[1:2], off offset:30
-; GFX11-NEXT:    global_load_u16 v19, v[1:2], off offset:44
-; GFX11-NEXT:    global_load_u16 v20, v[1:2], off offset:40
-; GFX11-NEXT:    global_load_u16 v21, v[1:2], off offset:36
-; GFX11-NEXT:    global_load_u16 v22, v[1:2], off offset:34
+; GFX11-NEXT:    global_load_u16 v19, v[1:2], off offset:34
+; GFX11-NEXT:    global_load_u16 v20, v[1:2], off offset:44
+; GFX11-NEXT:    global_load_u16 v21, v[1:2], off offset:40
+; GFX11-NEXT:    global_load_u16 v22, v[1:2], off offset:36
 ; GFX11-NEXT:    global_load_u16 v23, v[1:2], off offset:32
 ; GFX11-NEXT:    global_load_u16 v24, v[1:2], off offset:38
 ; GFX11-NEXT:    global_load_u16 v25, v[1:2], off offset:42
 ; GFX11-NEXT:    global_load_u16 v26, v[1:2], off offset:46
-; GFX11-NEXT:    global_load_u16 v27, v[1:2], off offset:60
-; GFX11-NEXT:    global_load_u16 v28, v[1:2], off offset:56
-; GFX11-NEXT:    global_load_u16 v29, v[1:2], off offset:52
-; GFX11-NEXT:    global_load_u16 v30, v[1:2], off offset:50
+; GFX11-NEXT:    global_load_u16 v27, v[1:2], off offset:50
+; GFX11-NEXT:    global_load_u16 v28, v[1:2], off offset:60
+; GFX11-NEXT:    global_load_u16 v29, v[1:2], off offset:56
+; GFX11-NEXT:    global_load_u16 v30, v[1:2], off offset:52
 ; GFX11-NEXT:    global_load_u16 v31, v[1:2], off offset:48
 ; GFX11-NEXT:    global_load_u16 v32, v[1:2], off offset:54
 ; GFX11-NEXT:    global_load_u16 v33, v[1:2], off offset:58
 ; GFX11-NEXT:    global_load_u16 v1, v[1:2], off offset:62
 ; GFX11-NEXT:    s_waitcnt vmcnt(31)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v39, 16, v3
+; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v3
 ; GFX11-NEXT:    s_waitcnt vmcnt(30)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v38, 16, v4
 ; GFX11-NEXT:    s_waitcnt vmcnt(29)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GFX11-NEXT:    s_waitcnt vmcnt(28)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v6
+; GFX11-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GFX11-NEXT:    s_waitcnt vmcnt(27)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v37, 16, v7
 ; GFX11-NEXT:    s_waitcnt vmcnt(26)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v6, 16, v8
+; GFX11-NEXT:    v_lshlrev_b32_e32 v7, 16, v8
 ; GFX11-NEXT:    s_waitcnt vmcnt(25)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
 ; GFX11-NEXT:    s_waitcnt vmcnt(24)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v10, 16, v10
 ; GFX11-NEXT:    s_waitcnt vmcnt(23)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v102, 16, v11
+; GFX11-NEXT:    v_lshlrev_b32_e32 v34, 16, v11
 ; GFX11-NEXT:    s_waitcnt vmcnt(22)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v101, 16, v12
+; GFX11-NEXT:    v_lshlrev_b32_e32 v100, 16, v12
 ; GFX11-NEXT:    s_waitcnt vmcnt(21)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v13, 16, v13
 ; GFX11-NEXT:    s_waitcnt vmcnt(20)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v14, 16, v14
 ; GFX11-NEXT:    s_waitcnt vmcnt(19)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v100, 16, v15
+; GFX11-NEXT:    v_lshlrev_b32_e32 v39, 16, v15
 ; GFX11-NEXT:    s_waitcnt vmcnt(18)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v34, 16, v16
+; GFX11-NEXT:    v_lshlrev_b32_e32 v35, 16, v16
 ; GFX11-NEXT:    s_waitcnt vmcnt(17)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v17, 16, v17
 ; GFX11-NEXT:    s_waitcnt vmcnt(16)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v18, 16, v18
 ; GFX11-NEXT:    s_waitcnt vmcnt(15)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v52, 16, v19
+; GFX11-NEXT:    v_lshlrev_b32_e32 v36, 16, v19
 ; GFX11-NEXT:    s_waitcnt vmcnt(14)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v49, 16, v20
 ; GFX11-NEXT:    s_waitcnt vmcnt(13)
@@ -8934,7 +8934,7 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX11-NEXT:    s_waitcnt vmcnt(12)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v22, 16, v22
 ; GFX11-NEXT:    s_waitcnt vmcnt(11)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v103, 16, v23
+; GFX11-NEXT:    v_lshlrev_b32_e32 v101, 16, v23
 ; GFX11-NEXT:    s_waitcnt vmcnt(10)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v48, 16, v24
 ; GFX11-NEXT:    s_waitcnt vmcnt(9)
@@ -8942,7 +8942,7 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX11-NEXT:    s_waitcnt vmcnt(8)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v26, 16, v26
 ; GFX11-NEXT:    s_waitcnt vmcnt(7)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v68, 16, v27
+; GFX11-NEXT:    v_lshlrev_b32_e32 v52, 16, v27
 ; GFX11-NEXT:    s_waitcnt vmcnt(6)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v65, 16, v28
 ; GFX11-NEXT:    s_waitcnt vmcnt(5)
@@ -8957,36 +8957,36 @@ define <32 x double> @global_extload_v32bf16_to_v32f64(ptr addrspace(1) %ptr) {
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v33, 16, v33
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[96:97], v68
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[84:85], v65
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[96:97], v65
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[84:85], v29
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[82:83], v64
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[86:87], v33
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[98:99], v1
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[80:81], v29
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[70:71], v30
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[80:81], v30
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[70:71], v52
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[68:69], v53
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[66:67], v26
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[64:65], v52
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[64:65], v49
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[54:55], v25
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[52:53], v49
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[52:53], v21
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[50:51], v48
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[48:49], v21
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[23:24], v34
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[35:36], v22
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[33:34], v103
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[48:49], v22
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[19:20], v34
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[23:24], v35
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[35:36], v36
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[33:34], v101
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[31:32], v18
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[29:30], v102
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[29:30], v100
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[27:28], v17
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[25:26], v101
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[21:22], v13
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[19:20], v14
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[17:18], v100
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[25:26], v13
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[21:22], v14
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[17:18], v39
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[15:16], v10
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[13:14], v39
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[13:14], v38
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[11:12], v9
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[9:10], v38
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[7:8], v6
-; GFX11-NEXT:    v_cvt_f64_f32_e32 v[5:6], v5
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[9:10], v5
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[7:8], v7
+; GFX11-NEXT:    v_cvt_f64_f32_e32 v[5:6], v6
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[3:4], v2
 ; GFX11-NEXT:    v_cvt_f64_f32_e32 v[1:2], v37
 ; GFX11-NEXT:    s_clause 0xf
@@ -11050,38 +11050,38 @@ define <16 x bfloat> @v_fadd_v16bf16(<16 x bfloat> %a, <16 x bfloat> %b) {
 ; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v22, 16, v0
 ; GFX11TRUE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v3.l, v3.h
-; GFX11TRUE16-NEXT:    v_dual_cndmask_b32 v10, v19, v21 :: v_dual_lshlrev_b32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v10, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v19, 0x400000, v2
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11TRUE16-NEXT:    v_dual_add_f32 v9, v22, v21 :: v_dual_and_b32 v8, 0xffff0000, v8
-; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
+; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v2, v16, v19, vcc_lo
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v16, v1, 16, 1
+; GFX11TRUE16-NEXT:    v_add_f32_e32 v9, v22, v21
+; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11TRUE16-NEXT:    v_add_f32_e32 v0, v0, v8
+; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v8, v9, 16, 1
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v2.l, v2.h
-; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
-; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v1, v16, v22, vcc_lo
 ; GFX11TRUE16-NEXT:    v_add_f32_e32 v17, v24, v23
+; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v24, 0x400000, v9
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v0
-; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v17, 16, 1
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v21, 0x400000, v17
+; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v8, v8, v24, vcc_lo
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_add3_u32 v19, v23, v17, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v0, 16, 1
+; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v8.l, v8.h
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v9, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_add3_u32 v16, v23, v0, 0x7fff
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
@@ -12394,280 +12394,280 @@ define <32 x bfloat> @v_fadd_v32bf16(<32 x bfloat> %a, <32 x bfloat> %b) {
 ; GFX10-LABEL: v_fadd_v32bf16:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v12
-; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
-; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX10-NEXT:    v_lshlrev_b32_e32 v39, 16, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v11
-; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v14
+; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v32, 16, v14
 ; GFX10-NEXT:    v_and_b32_e32 v30, 0xffff0000, v30
 ; GFX10-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v13
-; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v13
 ; GFX10-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GFX10-NEXT:    v_add_f32_e32 v12, v12, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v28, 16, v22
-; GFX10-NEXT:    v_add_f32_e32 v39, v48, v39
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v6
-; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX10-NEXT:    v_add_f32_e32 v11, v11, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v27, 16, v21
-; GFX10-NEXT:    v_add_f32_e32 v49, v50, v49
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v5
-; GFX10-NEXT:    v_add_f32_e32 v33, v34, v33
-; GFX10-NEXT:    v_add_f32_e32 v14, v14, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v30, 16, v24
-; GFX10-NEXT:    v_add_f32_e32 v35, v36, v35
+; GFX10-NEXT:    v_add_f32_e32 v31, v32, v31
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v12
+; GFX10-NEXT:    v_add_f32_e32 v30, v14, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v14, 16, v29
+; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_bfe_u32 v32, v31, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v31
+; GFX10-NEXT:    v_bfe_u32 v35, v30, 16, 1
+; GFX10-NEXT:    v_add_f32_e32 v33, v33, v14
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v31, v31
+; GFX10-NEXT:    v_add3_u32 v32, v32, v31, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX10-NEXT:    v_add3_u32 v31, v35, v30, 0x7fff
+; GFX10-NEXT:    v_add_f32_e32 v35, v13, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v13, 16, v28
+; GFX10-NEXT:    v_cndmask_b32_e32 v14, v32, v34, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v30
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
+; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v21
+; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v5
+; GFX10-NEXT:    v_add3_u32 v30, v34, v33, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v29, v31, v32, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_add_f32_e32 v34, v36, v13
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_add_f32_e32 v33, v12, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v12, 16, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v11
+; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
+; GFX10-NEXT:    v_cndmask_b32_e32 v13, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_add_f32_e32 v35, v36, v12
+; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v10
+; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX10-NEXT:    v_cndmask_b32_e32 v28, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_add_f32_e32 v34, v11, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v11, 16, v26
+; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
+; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
+; GFX10-NEXT:    v_cndmask_b32_e32 v12, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_add_f32_e32 v33, v36, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v9
+; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v3
+; GFX10-NEXT:    v_cndmask_b32_e32 v27, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_add_f32_e32 v35, v10, v26
+; GFX10-NEXT:    v_lshlrev_b32_e32 v10, 16, v25
+; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
+; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v2
+; GFX10-NEXT:    v_cndmask_b32_e32 v11, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_add_f32_e32 v34, v36, v10
+; GFX10-NEXT:    v_add_f32_e32 v9, v9, v25
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v8
-; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
 ; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX10-NEXT:    v_add_f32_e32 v13, v13, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v29, 16, v23
-; GFX10-NEXT:    v_add_f32_e32 v37, v38, v37
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v26, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v24
+; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_lshlrev_b32_e32 v52, 16, v1
+; GFX10-NEXT:    v_cndmask_b32_e32 v10, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_add_f32_e32 v33, v36, v33
+; GFX10-NEXT:    v_add_f32_e32 v8, v8, v24
+; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v23
+; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v25, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v9, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
 ; GFX10-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GFX10-NEXT:    v_add_f32_e32 v6, v6, v22
-; GFX10-NEXT:    v_lshlrev_b32_e32 v22, 16, v16
-; GFX10-NEXT:    v_add_f32_e32 v27, v50, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v0
-; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
-; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
-; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v9
-; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GFX10-NEXT:    v_add_f32_e32 v8, v8, v24
-; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v18
-; GFX10-NEXT:    v_add_f32_e32 v29, v38, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v2
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_add_f32_e32 v24, v35, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v30, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v32, v9, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v9
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX10-NEXT:    v_add_f32_e32 v7, v7, v23
-; GFX10-NEXT:    v_lshlrev_b32_e32 v23, 16, v17
-; GFX10-NEXT:    v_add_f32_e32 v28, v48, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v1
-; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_bfe_u32 v23, v24, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v36, 0x400000, v24
+; GFX10-NEXT:    v_cmp_u_f32_e64 s4, v24, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v9, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v34, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v34, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v8
+; GFX10-NEXT:    v_bfe_u32 v35, v7, 16, 1
+; GFX10-NEXT:    v_add3_u32 v23, v23, v24, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e64 s5, v7, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v31, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v32, v34, v8, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v22
+; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
+; GFX10-NEXT:    v_add3_u32 v24, v35, v7, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_add_f32_e32 v8, v34, v8
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
 ; GFX10-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX10-NEXT:    v_add_f32_e32 v0, v0, v16
-; GFX10-NEXT:    v_bfe_u32 v16, v33, 16, 1
-; GFX10-NEXT:    v_add_f32_e32 v10, v10, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v26, 16, v20
-; GFX10-NEXT:    v_add_f32_e32 v34, v34, v51
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v4
+; GFX10-NEXT:    v_add_f32_e32 v6, v6, v22
+; GFX10-NEXT:    v_cndmask_b32_e32 v32, v32, v33, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s6, v8, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s7, v6, v6
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v15
+; GFX10-NEXT:    v_add3_u32 v7, v35, v8, 0x7fff
+; GFX10-NEXT:    v_add_f32_e32 v35, v38, v37
+; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v21
+; GFX10-NEXT:    v_bfe_u32 v37, v6, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v6
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, v22, s6
+; GFX10-NEXT:    v_bfe_u32 v21, v35, 16, 1
+; GFX10-NEXT:    v_add_f32_e32 v5, v5, v8
+; GFX10-NEXT:    v_add3_u32 v37, v37, v6, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v20
 ; GFX10-NEXT:    v_and_b32_e32 v20, 0xffff0000, v20
+; GFX10-NEXT:    v_add3_u32 v6, v21, v35, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v21, 16, v4
+; GFX10-NEXT:    v_bfe_u32 v48, v5, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX10-NEXT:    v_add_f32_e32 v9, v9, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v25, 16, v19
-; GFX10-NEXT:    v_add_f32_e32 v30, v36, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v3
+; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v35
+; GFX10-NEXT:    v_cmp_u_f32_e64 s8, v35, v35
+; GFX10-NEXT:    v_add_f32_e32 v8, v21, v8
+; GFX10-NEXT:    v_add3_u32 v21, v48, v5, 0x7fff
+; GFX10-NEXT:    v_add_f32_e32 v4, v4, v20
+; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v19
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v5
+; GFX10-NEXT:    v_bfe_u32 v20, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s9, v5, v5
+; GFX10-NEXT:    v_bfe_u32 v5, v4, 16, 1
+; GFX10-NEXT:    v_add_f32_e32 v48, v49, v48
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v18
+; GFX10-NEXT:    v_add3_u32 v20, v20, v8, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s10, v8, v8
+; GFX10-NEXT:    v_add3_u32 v5, v5, v4, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v4
+; GFX10-NEXT:    v_cmp_u_f32_e64 s11, v4, v4
+; GFX10-NEXT:    v_bfe_u32 v4, v48, 16, 1
+; GFX10-NEXT:    v_add_f32_e32 v49, v51, v49
+; GFX10-NEXT:    v_or_b32_e32 v51, 0x400000, v48
+; GFX10-NEXT:    v_cmp_u_f32_e64 s12, v48, v48
 ; GFX10-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_add3_u32 v4, v4, v48, 0x7fff
+; GFX10-NEXT:    v_bfe_u32 v48, v49, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s13, v49, v49
+; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
+; GFX10-NEXT:    v_add_f32_e32 v3, v3, v19
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, v39, s8
+; GFX10-NEXT:    v_add3_u32 v19, v48, v49, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v49
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v17
 ; GFX10-NEXT:    v_add_f32_e32 v2, v2, v18
-; GFX10-NEXT:    v_add_f32_e32 v18, v48, v23
+; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v21, v35, s9
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v20, v50, s10
+; GFX10-NEXT:    v_add_f32_e32 v49, v52, v49
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v5, v8, s11
 ; GFX10-NEXT:    v_add_f32_e32 v1, v1, v17
-; GFX10-NEXT:    v_add_f32_e32 v17, v50, v22
-; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v33
-; GFX10-NEXT:    v_bfe_u32 v23, v14, 16, 1
-; GFX10-NEXT:    v_add3_u32 v16, v16, v33, 0x7fff
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX10-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
-; GFX10-NEXT:    v_add_f32_e32 v4, v4, v20
-; GFX10-NEXT:    v_add_f32_e32 v20, v36, v25
-; GFX10-NEXT:    v_add_f32_e32 v3, v3, v19
-; GFX10-NEXT:    v_add_f32_e32 v19, v38, v24
-; GFX10-NEXT:    v_or_b32_e32 v24, 0x400000, v14
-; GFX10-NEXT:    v_bfe_u32 v25, v35, 16, 1
-; GFX10-NEXT:    v_add3_u32 v23, v23, v14, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v22, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX10-NEXT:    v_add_f32_e32 v5, v5, v21
-; GFX10-NEXT:    v_add_f32_e32 v21, v51, v26
-; GFX10-NEXT:    v_or_b32_e32 v26, 0x400000, v35
-; GFX10-NEXT:    v_bfe_u32 v36, v13, 16, 1
-; GFX10-NEXT:    v_add3_u32 v25, v25, v35, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v23, v23, v24, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v13
-; GFX10-NEXT:    v_bfe_u32 v48, v37, 16, 1
-; GFX10-NEXT:    v_add3_u32 v36, v36, v13, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v37
-; GFX10-NEXT:    v_cndmask_b32_e32 v25, v25, v26, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
-; GFX10-NEXT:    v_bfe_u32 v51, v12, 16, 1
-; GFX10-NEXT:    v_add3_u32 v48, v48, v37, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v12
-; GFX10-NEXT:    v_bfe_u32 v22, v39, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v36, v36, v38, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX10-NEXT:    v_add3_u32 v51, v51, v12, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v14, 0x400000, v39
-; GFX10-NEXT:    v_bfe_u32 v24, v11, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v39, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v48, v48, v50, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v11
-; GFX10-NEXT:    v_bfe_u32 v26, v49, 16, 1
-; GFX10-NEXT:    v_add3_u32 v24, v24, v11, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v13, 0x400000, v49
-; GFX10-NEXT:    v_cndmask_b32_e32 v33, v51, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v39, v39
-; GFX10-NEXT:    v_bfe_u32 v38, v10, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v49, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v37, 0x400000, v10
-; GFX10-NEXT:    v_bfe_u32 v50, v34, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, v22, v14, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v11, v11
-; GFX10-NEXT:    v_add3_u32 v38, v38, v10, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v34
-; GFX10-NEXT:    v_bfe_u32 v51, v9, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v34, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v24, v24, v35, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v49, v49
-; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v9
-; GFX10-NEXT:    v_bfe_u32 v22, v30, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v9, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v11, 0x400000, v30
-; GFX10-NEXT:    v_cndmask_b32_e32 v13, v26, v13, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v10, v10
-; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v30, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v49, 0x400000, v8
-; GFX10-NEXT:    v_bfe_u32 v26, v29, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v37, v38, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX10-NEXT:    v_add3_u32 v35, v35, v8, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v10, 0x400000, v29
-; GFX10-NEXT:    v_bfe_u32 v38, v7, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v29, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v12, v50, v12, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
-; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
-; GFX10-NEXT:    v_bfe_u32 v50, v28, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v7, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v9, 0x400000, v28
-; GFX10-NEXT:    v_cndmask_b32_e32 v39, v51, v39, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
-; GFX10-NEXT:    v_bfe_u32 v51, v6, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v28, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v30, 0x400000, v6
-; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, v22, v11, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
-; GFX10-NEXT:    v_bfe_u32 v22, v27, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v6, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v27
-; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v35, v35, v49, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v29, v29
-; GFX10-NEXT:    v_bfe_u32 v49, v5, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v27, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v29, 0x400000, v5
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v26, v10, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX10-NEXT:    v_bfe_u32 v26, v21, 16, 1
-; GFX10-NEXT:    v_add3_u32 v49, v49, v5, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v7, 0x400000, v21
-; GFX10-NEXT:    v_cndmask_b32_e32 v34, v38, v34, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v28, v28
-; GFX10-NEXT:    v_bfe_u32 v38, v4, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v21, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v28, 0x400000, v4
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v50, v9, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX10-NEXT:    v_bfe_u32 v50, v20, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v4, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v6, 0x400000, v20
-; GFX10-NEXT:    v_cndmask_b32_e32 v30, v51, v30, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v27, v27
-; GFX10-NEXT:    v_add3_u32 v50, v50, v20, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v51, v3, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v27, 0x400000, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v22, v8, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX10-NEXT:    v_bfe_u32 v22, v19, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v5, 0x400000, v19
-; GFX10-NEXT:    v_add3_u32 v51, v51, v3, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v29, v49, v29, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX10-NEXT:    v_add3_u32 v22, v22, v19, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v49, v2, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v21, 0x400000, v2
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, v26, v7, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX10-NEXT:    v_bfe_u32 v26, v18, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v4, 0x400000, v18
-; GFX10-NEXT:    v_add3_u32 v49, v49, v2, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v28, v38, v28, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX10-NEXT:    v_bfe_u32 v38, v1, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v18, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v50, v6, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
-; GFX10-NEXT:    v_bfe_u32 v50, v17, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v1, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v22, v5, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX10-NEXT:    v_bfe_u32 v22, v0, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v17, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v18, 0x400000, v0
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v26, v4, vcc_lo
+; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_bfe_u32 v18, v49, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v52, 0x400000, v49
+; GFX10-NEXT:    v_cmp_u_f32_e64 s14, v49, v49
+; GFX10-NEXT:    v_bfe_u32 v39, v1, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v1
+; GFX10-NEXT:    v_add3_u32 v18, v18, v49, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
+; GFX10-NEXT:    v_add3_u32 v39, v39, v1, 0x7fff
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v0, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v38, v20, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v19, v48, s13
+; GFX10-NEXT:    v_add_f32_e32 v17, v49, v17
+; GFX10-NEXT:    v_add_f32_e32 v0, v0, v16
+; GFX10-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v39, v35, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v22, v2, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v49, v17, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v17
+; GFX10-NEXT:    v_bfe_u32 v50, v0, 16, 1
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_perm_b32 v1, v1, v4, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v50, v19, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v0
+; GFX10-NEXT:    v_add3_u32 v49, v49, v17, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
+; GFX10-NEXT:    v_add3_u32 v50, v50, v0, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v23, v36, s4
+; GFX10-NEXT:    v_bfe_u32 v36, v3, 16, 1
+; GFX10-NEXT:    v_cndmask_b32_e32 v8, v49, v8, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
-; GFX10-NEXT:    v_perm_b32 v4, v28, v7, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v7, v34, v10, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v0, v22, v18, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v37, v37, v38, s7
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v2
+; GFX10-NEXT:    v_add3_u32 v22, v22, v2, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v24, v34, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v50, v48, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX10-NEXT:    v_perm_b32 v0, v0, v17, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v49, v21, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v3
+; GFX10-NEXT:    v_add3_u32 v36, v36, v3, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v18, v52, s14
+; GFX10-NEXT:    v_perm_b32 v0, v0, v8, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v22, v38, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX10-NEXT:    v_perm_b32 v2, v2, v5, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v51, v27, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v5, v29, v8, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v8, v35, v11, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v3, v3, v6, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v6, v30, v9, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v9, v39, v12, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v51, s12
+; GFX10-NEXT:    v_perm_b32 v1, v1, v18, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v9, v9, v30, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v2, v2, v19, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v36, v34, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v10, v25, v10, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v11, v26, v11, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v12, v27, v12, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v13, v28, v13, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v3, v3, v4, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v4, v5, v20, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v5, v21, v6, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v6, v37, v7, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v7, v24, v23, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v14, v29, v14, 0x7060302
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v32
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v32
-; GFX10-NEXT:    v_add_f32_e32 v17, v31, v17
-; GFX10-NEXT:    v_add_f32_e32 v15, v15, v18
-; GFX10-NEXT:    v_bfe_u32 v10, v17, 16, 1
-; GFX10-NEXT:    v_bfe_u32 v11, v15, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v17
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_add_f32_e32 v17, v33, v8
+; GFX10-NEXT:    v_add_f32_e32 v15, v15, v16
+; GFX10-NEXT:    v_perm_b32 v8, v32, v31, 0x7060302
+; GFX10-NEXT:    v_bfe_u32 v16, v17, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v18, v15, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v15
-; GFX10-NEXT:    v_add3_u32 v18, v10, v17, 0x7fff
-; GFX10-NEXT:    v_add3_u32 v11, v11, v15, 0x7fff
-; GFX10-NEXT:    v_perm_b32 v10, v37, v13, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v13, v36, v25, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v18, v12, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v15
+; GFX10-NEXT:    v_add3_u32 v16, v16, v17, 0x7fff
+; GFX10-NEXT:    v_add3_u32 v18, v18, v15, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v19, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX10-NEXT:    v_perm_b32 v12, v33, v48, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v15, v11, v19, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v11, v24, v14, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v14, v23, v16, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v15, v15, v17, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v15, v18, v20, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v15, v15, v16, 0x7060302
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11TRUE16-LABEL: v_fadd_v32bf16:
@@ -16243,38 +16243,38 @@ define <16 x bfloat> @v_fmul_v16bf16(<16 x bfloat> %a, <16 x bfloat> %b) {
 ; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v22, 16, v0
 ; GFX11TRUE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v3.l, v3.h
-; GFX11TRUE16-NEXT:    v_dual_cndmask_b32 v10, v19, v21 :: v_dual_lshlrev_b32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v10, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v19, 0x400000, v2
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11TRUE16-NEXT:    v_dual_mul_f32 v9, v22, v21 :: v_dual_and_b32 v8, 0xffff0000, v8
-; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
+; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v2, v16, v19, vcc_lo
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v16, v1, 16, 1
+; GFX11TRUE16-NEXT:    v_mul_f32_e32 v9, v22, v21
+; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11TRUE16-NEXT:    v_mul_f32_e32 v0, v0, v8
+; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v8, v9, 16, 1
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v2.l, v2.h
-; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
-; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v1, v16, v22, vcc_lo
 ; GFX11TRUE16-NEXT:    v_mul_f32_e32 v17, v24, v23
+; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v24, 0x400000, v9
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v0
-; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v17, 16, 1
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v21, 0x400000, v17
+; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v8, v8, v24, vcc_lo
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_add3_u32 v19, v23, v17, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v0, 16, 1
+; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v8.l, v8.h
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v9, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_add3_u32 v16, v23, v0, 0x7fff
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
@@ -17587,280 +17587,280 @@ define <32 x bfloat> @v_fmul_v32bf16(<32 x bfloat> %a, <32 x bfloat> %b) {
 ; GFX10-LABEL: v_fmul_v32bf16:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v12
-; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
-; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX10-NEXT:    v_lshlrev_b32_e32 v39, 16, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v11
-; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v14
+; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v32, 16, v14
 ; GFX10-NEXT:    v_and_b32_e32 v30, 0xffff0000, v30
 ; GFX10-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v13
-; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v13
 ; GFX10-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GFX10-NEXT:    v_mul_f32_e32 v12, v12, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v28, 16, v22
-; GFX10-NEXT:    v_mul_f32_e32 v39, v48, v39
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v6
-; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX10-NEXT:    v_mul_f32_e32 v11, v11, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v27, 16, v21
-; GFX10-NEXT:    v_mul_f32_e32 v49, v50, v49
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v5
-; GFX10-NEXT:    v_mul_f32_e32 v33, v34, v33
-; GFX10-NEXT:    v_mul_f32_e32 v14, v14, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v30, 16, v24
-; GFX10-NEXT:    v_mul_f32_e32 v35, v36, v35
+; GFX10-NEXT:    v_mul_f32_e32 v31, v32, v31
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v12
+; GFX10-NEXT:    v_mul_f32_e32 v30, v14, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v14, 16, v29
+; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_bfe_u32 v32, v31, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v31
+; GFX10-NEXT:    v_bfe_u32 v35, v30, 16, 1
+; GFX10-NEXT:    v_mul_f32_e32 v33, v33, v14
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v31, v31
+; GFX10-NEXT:    v_add3_u32 v32, v32, v31, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX10-NEXT:    v_add3_u32 v31, v35, v30, 0x7fff
+; GFX10-NEXT:    v_mul_f32_e32 v35, v13, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v13, 16, v28
+; GFX10-NEXT:    v_cndmask_b32_e32 v14, v32, v34, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v30
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
+; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v21
+; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v5
+; GFX10-NEXT:    v_add3_u32 v30, v34, v33, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v29, v31, v32, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_mul_f32_e32 v34, v36, v13
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_mul_f32_e32 v33, v12, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v12, 16, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v11
+; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
+; GFX10-NEXT:    v_cndmask_b32_e32 v13, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_mul_f32_e32 v35, v36, v12
+; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v10
+; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX10-NEXT:    v_cndmask_b32_e32 v28, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_mul_f32_e32 v34, v11, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v11, 16, v26
+; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
+; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
+; GFX10-NEXT:    v_cndmask_b32_e32 v12, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_mul_f32_e32 v33, v36, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v9
+; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v3
+; GFX10-NEXT:    v_cndmask_b32_e32 v27, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_mul_f32_e32 v35, v10, v26
+; GFX10-NEXT:    v_lshlrev_b32_e32 v10, 16, v25
+; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
+; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v2
+; GFX10-NEXT:    v_cndmask_b32_e32 v11, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_mul_f32_e32 v34, v36, v10
+; GFX10-NEXT:    v_mul_f32_e32 v9, v9, v25
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v8
-; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
 ; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX10-NEXT:    v_mul_f32_e32 v13, v13, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v29, 16, v23
-; GFX10-NEXT:    v_mul_f32_e32 v37, v38, v37
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v26, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v24
+; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_lshlrev_b32_e32 v52, 16, v1
+; GFX10-NEXT:    v_cndmask_b32_e32 v10, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_mul_f32_e32 v33, v36, v33
+; GFX10-NEXT:    v_mul_f32_e32 v8, v8, v24
+; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v23
+; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v25, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v9, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
 ; GFX10-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GFX10-NEXT:    v_mul_f32_e32 v6, v6, v22
-; GFX10-NEXT:    v_lshlrev_b32_e32 v22, 16, v16
-; GFX10-NEXT:    v_mul_f32_e32 v27, v50, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v0
-; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
-; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
-; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v9
-; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GFX10-NEXT:    v_mul_f32_e32 v8, v8, v24
-; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v18
-; GFX10-NEXT:    v_mul_f32_e32 v29, v38, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v2
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_mul_f32_e32 v24, v35, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v30, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v32, v9, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v9
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX10-NEXT:    v_mul_f32_e32 v7, v7, v23
-; GFX10-NEXT:    v_lshlrev_b32_e32 v23, 16, v17
-; GFX10-NEXT:    v_mul_f32_e32 v28, v48, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v1
-; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_bfe_u32 v23, v24, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v36, 0x400000, v24
+; GFX10-NEXT:    v_cmp_u_f32_e64 s4, v24, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v9, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v34, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v34, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v8
+; GFX10-NEXT:    v_bfe_u32 v35, v7, 16, 1
+; GFX10-NEXT:    v_add3_u32 v23, v23, v24, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e64 s5, v7, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v31, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v32, v34, v8, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v22
+; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
+; GFX10-NEXT:    v_add3_u32 v24, v35, v7, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_mul_f32_e32 v8, v34, v8
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
 ; GFX10-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX10-NEXT:    v_mul_f32_e32 v0, v0, v16
-; GFX10-NEXT:    v_bfe_u32 v16, v33, 16, 1
-; GFX10-NEXT:    v_mul_f32_e32 v10, v10, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v26, 16, v20
-; GFX10-NEXT:    v_mul_f32_e32 v34, v34, v51
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v4
+; GFX10-NEXT:    v_mul_f32_e32 v6, v6, v22
+; GFX10-NEXT:    v_cndmask_b32_e32 v32, v32, v33, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s6, v8, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s7, v6, v6
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v15
+; GFX10-NEXT:    v_add3_u32 v7, v35, v8, 0x7fff
+; GFX10-NEXT:    v_mul_f32_e32 v35, v38, v37
+; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v21
+; GFX10-NEXT:    v_bfe_u32 v37, v6, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v6
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, v22, s6
+; GFX10-NEXT:    v_bfe_u32 v21, v35, 16, 1
+; GFX10-NEXT:    v_mul_f32_e32 v5, v5, v8
+; GFX10-NEXT:    v_add3_u32 v37, v37, v6, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v20
 ; GFX10-NEXT:    v_and_b32_e32 v20, 0xffff0000, v20
+; GFX10-NEXT:    v_add3_u32 v6, v21, v35, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v21, 16, v4
+; GFX10-NEXT:    v_bfe_u32 v48, v5, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX10-NEXT:    v_mul_f32_e32 v9, v9, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v25, 16, v19
-; GFX10-NEXT:    v_mul_f32_e32 v30, v36, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v3
+; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v35
+; GFX10-NEXT:    v_cmp_u_f32_e64 s8, v35, v35
+; GFX10-NEXT:    v_mul_f32_e32 v8, v21, v8
+; GFX10-NEXT:    v_add3_u32 v21, v48, v5, 0x7fff
+; GFX10-NEXT:    v_mul_f32_e32 v4, v4, v20
+; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v19
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v5
+; GFX10-NEXT:    v_bfe_u32 v20, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s9, v5, v5
+; GFX10-NEXT:    v_bfe_u32 v5, v4, 16, 1
+; GFX10-NEXT:    v_mul_f32_e32 v48, v49, v48
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v18
+; GFX10-NEXT:    v_add3_u32 v20, v20, v8, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s10, v8, v8
+; GFX10-NEXT:    v_add3_u32 v5, v5, v4, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v4
+; GFX10-NEXT:    v_cmp_u_f32_e64 s11, v4, v4
+; GFX10-NEXT:    v_bfe_u32 v4, v48, 16, 1
+; GFX10-NEXT:    v_mul_f32_e32 v49, v51, v49
+; GFX10-NEXT:    v_or_b32_e32 v51, 0x400000, v48
+; GFX10-NEXT:    v_cmp_u_f32_e64 s12, v48, v48
 ; GFX10-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_add3_u32 v4, v4, v48, 0x7fff
+; GFX10-NEXT:    v_bfe_u32 v48, v49, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s13, v49, v49
+; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
+; GFX10-NEXT:    v_mul_f32_e32 v3, v3, v19
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, v39, s8
+; GFX10-NEXT:    v_add3_u32 v19, v48, v49, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v49
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v17
 ; GFX10-NEXT:    v_mul_f32_e32 v2, v2, v18
-; GFX10-NEXT:    v_mul_f32_e32 v18, v48, v23
+; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v21, v35, s9
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v20, v50, s10
+; GFX10-NEXT:    v_mul_f32_e32 v49, v52, v49
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v5, v8, s11
 ; GFX10-NEXT:    v_mul_f32_e32 v1, v1, v17
-; GFX10-NEXT:    v_mul_f32_e32 v17, v50, v22
-; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v33
-; GFX10-NEXT:    v_bfe_u32 v23, v14, 16, 1
-; GFX10-NEXT:    v_add3_u32 v16, v16, v33, 0x7fff
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX10-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
-; GFX10-NEXT:    v_mul_f32_e32 v4, v4, v20
-; GFX10-NEXT:    v_mul_f32_e32 v20, v36, v25
-; GFX10-NEXT:    v_mul_f32_e32 v3, v3, v19
-; GFX10-NEXT:    v_mul_f32_e32 v19, v38, v24
-; GFX10-NEXT:    v_or_b32_e32 v24, 0x400000, v14
-; GFX10-NEXT:    v_bfe_u32 v25, v35, 16, 1
-; GFX10-NEXT:    v_add3_u32 v23, v23, v14, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v22, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX10-NEXT:    v_mul_f32_e32 v5, v5, v21
-; GFX10-NEXT:    v_mul_f32_e32 v21, v51, v26
-; GFX10-NEXT:    v_or_b32_e32 v26, 0x400000, v35
-; GFX10-NEXT:    v_bfe_u32 v36, v13, 16, 1
-; GFX10-NEXT:    v_add3_u32 v25, v25, v35, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v23, v23, v24, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v13
-; GFX10-NEXT:    v_bfe_u32 v48, v37, 16, 1
-; GFX10-NEXT:    v_add3_u32 v36, v36, v13, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v37
-; GFX10-NEXT:    v_cndmask_b32_e32 v25, v25, v26, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
-; GFX10-NEXT:    v_bfe_u32 v51, v12, 16, 1
-; GFX10-NEXT:    v_add3_u32 v48, v48, v37, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v12
-; GFX10-NEXT:    v_bfe_u32 v22, v39, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v36, v36, v38, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX10-NEXT:    v_add3_u32 v51, v51, v12, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v14, 0x400000, v39
-; GFX10-NEXT:    v_bfe_u32 v24, v11, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v39, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v48, v48, v50, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v11
-; GFX10-NEXT:    v_bfe_u32 v26, v49, 16, 1
-; GFX10-NEXT:    v_add3_u32 v24, v24, v11, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v13, 0x400000, v49
-; GFX10-NEXT:    v_cndmask_b32_e32 v33, v51, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v39, v39
-; GFX10-NEXT:    v_bfe_u32 v38, v10, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v49, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v37, 0x400000, v10
-; GFX10-NEXT:    v_bfe_u32 v50, v34, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, v22, v14, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v11, v11
-; GFX10-NEXT:    v_add3_u32 v38, v38, v10, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v34
-; GFX10-NEXT:    v_bfe_u32 v51, v9, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v34, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v24, v24, v35, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v49, v49
-; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v9
-; GFX10-NEXT:    v_bfe_u32 v22, v30, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v9, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v11, 0x400000, v30
-; GFX10-NEXT:    v_cndmask_b32_e32 v13, v26, v13, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v10, v10
-; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v30, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v49, 0x400000, v8
-; GFX10-NEXT:    v_bfe_u32 v26, v29, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v37, v38, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX10-NEXT:    v_add3_u32 v35, v35, v8, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v10, 0x400000, v29
-; GFX10-NEXT:    v_bfe_u32 v38, v7, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v29, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v12, v50, v12, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
-; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
-; GFX10-NEXT:    v_bfe_u32 v50, v28, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v7, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v9, 0x400000, v28
-; GFX10-NEXT:    v_cndmask_b32_e32 v39, v51, v39, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
-; GFX10-NEXT:    v_bfe_u32 v51, v6, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v28, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v30, 0x400000, v6
-; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, v22, v11, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
-; GFX10-NEXT:    v_bfe_u32 v22, v27, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v6, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v27
-; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v35, v35, v49, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v29, v29
-; GFX10-NEXT:    v_bfe_u32 v49, v5, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v27, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v29, 0x400000, v5
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v26, v10, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX10-NEXT:    v_bfe_u32 v26, v21, 16, 1
-; GFX10-NEXT:    v_add3_u32 v49, v49, v5, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v7, 0x400000, v21
-; GFX10-NEXT:    v_cndmask_b32_e32 v34, v38, v34, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v28, v28
-; GFX10-NEXT:    v_bfe_u32 v38, v4, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v21, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v28, 0x400000, v4
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v50, v9, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX10-NEXT:    v_bfe_u32 v50, v20, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v4, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v6, 0x400000, v20
-; GFX10-NEXT:    v_cndmask_b32_e32 v30, v51, v30, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v27, v27
-; GFX10-NEXT:    v_add3_u32 v50, v50, v20, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v51, v3, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v27, 0x400000, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v22, v8, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX10-NEXT:    v_bfe_u32 v22, v19, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v5, 0x400000, v19
-; GFX10-NEXT:    v_add3_u32 v51, v51, v3, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v29, v49, v29, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX10-NEXT:    v_add3_u32 v22, v22, v19, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v49, v2, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v21, 0x400000, v2
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, v26, v7, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX10-NEXT:    v_bfe_u32 v26, v18, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v4, 0x400000, v18
-; GFX10-NEXT:    v_add3_u32 v49, v49, v2, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v28, v38, v28, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX10-NEXT:    v_bfe_u32 v38, v1, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v18, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v50, v6, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
-; GFX10-NEXT:    v_bfe_u32 v50, v17, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v1, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v22, v5, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX10-NEXT:    v_bfe_u32 v22, v0, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v17, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v18, 0x400000, v0
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v26, v4, vcc_lo
+; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_bfe_u32 v18, v49, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v52, 0x400000, v49
+; GFX10-NEXT:    v_cmp_u_f32_e64 s14, v49, v49
+; GFX10-NEXT:    v_bfe_u32 v39, v1, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v1
+; GFX10-NEXT:    v_add3_u32 v18, v18, v49, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
+; GFX10-NEXT:    v_add3_u32 v39, v39, v1, 0x7fff
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v0, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v38, v20, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v19, v48, s13
+; GFX10-NEXT:    v_mul_f32_e32 v17, v49, v17
+; GFX10-NEXT:    v_mul_f32_e32 v0, v0, v16
+; GFX10-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v39, v35, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v22, v2, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v49, v17, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v17
+; GFX10-NEXT:    v_bfe_u32 v50, v0, 16, 1
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_perm_b32 v1, v1, v4, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v50, v19, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v0
+; GFX10-NEXT:    v_add3_u32 v49, v49, v17, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
+; GFX10-NEXT:    v_add3_u32 v50, v50, v0, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v23, v36, s4
+; GFX10-NEXT:    v_bfe_u32 v36, v3, 16, 1
+; GFX10-NEXT:    v_cndmask_b32_e32 v8, v49, v8, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
-; GFX10-NEXT:    v_perm_b32 v4, v28, v7, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v7, v34, v10, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v0, v22, v18, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v37, v37, v38, s7
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v2
+; GFX10-NEXT:    v_add3_u32 v22, v22, v2, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v24, v34, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v50, v48, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX10-NEXT:    v_perm_b32 v0, v0, v17, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v49, v21, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v3
+; GFX10-NEXT:    v_add3_u32 v36, v36, v3, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v18, v52, s14
+; GFX10-NEXT:    v_perm_b32 v0, v0, v8, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v22, v38, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX10-NEXT:    v_perm_b32 v2, v2, v5, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v51, v27, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v5, v29, v8, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v8, v35, v11, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v3, v3, v6, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v6, v30, v9, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v9, v39, v12, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v51, s12
+; GFX10-NEXT:    v_perm_b32 v1, v1, v18, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v9, v9, v30, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v2, v2, v19, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v36, v34, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v10, v25, v10, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v11, v26, v11, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v12, v27, v12, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v13, v28, v13, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v3, v3, v4, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v4, v5, v20, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v5, v21, v6, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v6, v37, v7, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v7, v24, v23, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v14, v29, v14, 0x7060302
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v32
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v32
-; GFX10-NEXT:    v_mul_f32_e32 v17, v31, v17
-; GFX10-NEXT:    v_mul_f32_e32 v15, v15, v18
-; GFX10-NEXT:    v_bfe_u32 v10, v17, 16, 1
-; GFX10-NEXT:    v_bfe_u32 v11, v15, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v17
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_mul_f32_e32 v17, v33, v8
+; GFX10-NEXT:    v_mul_f32_e32 v15, v15, v16
+; GFX10-NEXT:    v_perm_b32 v8, v32, v31, 0x7060302
+; GFX10-NEXT:    v_bfe_u32 v16, v17, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v18, v15, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v15
-; GFX10-NEXT:    v_add3_u32 v18, v10, v17, 0x7fff
-; GFX10-NEXT:    v_add3_u32 v11, v11, v15, 0x7fff
-; GFX10-NEXT:    v_perm_b32 v10, v37, v13, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v13, v36, v25, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v18, v12, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v15
+; GFX10-NEXT:    v_add3_u32 v16, v16, v17, 0x7fff
+; GFX10-NEXT:    v_add3_u32 v18, v18, v15, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v19, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX10-NEXT:    v_perm_b32 v12, v33, v48, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v15, v11, v19, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v11, v24, v14, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v14, v23, v16, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v15, v15, v17, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v15, v18, v20, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v15, v15, v16, 0x7060302
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11TRUE16-LABEL: v_fmul_v32bf16:
@@ -20986,38 +20986,38 @@ define <16 x bfloat> @v_minnum_v16bf16(<16 x bfloat> %a, <16 x bfloat> %b) {
 ; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v22, 16, v0
 ; GFX11TRUE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v3.l, v3.h
-; GFX11TRUE16-NEXT:    v_dual_cndmask_b32 v10, v19, v21 :: v_dual_lshlrev_b32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v10, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v19, 0x400000, v2
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11TRUE16-NEXT:    v_dual_min_f32 v9, v22, v21 :: v_dual_and_b32 v8, 0xffff0000, v8
-; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
+; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v2, v16, v19, vcc_lo
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v16, v1, 16, 1
+; GFX11TRUE16-NEXT:    v_min_f32_e32 v9, v22, v21
+; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11TRUE16-NEXT:    v_min_f32_e32 v0, v0, v8
+; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v8, v9, 16, 1
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v2.l, v2.h
-; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
-; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v1, v16, v22, vcc_lo
 ; GFX11TRUE16-NEXT:    v_min_f32_e32 v17, v24, v23
+; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v24, 0x400000, v9
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v0
-; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v17, 16, 1
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v21, 0x400000, v17
+; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v8, v8, v24, vcc_lo
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_add3_u32 v19, v23, v17, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v0, 16, 1
+; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v8.l, v8.h
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v9, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_add3_u32 v16, v23, v0, 0x7fff
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
@@ -22330,280 +22330,280 @@ define <32 x bfloat> @v_minnum_v32bf16(<32 x bfloat> %a, <32 x bfloat> %b) {
 ; GFX10-LABEL: v_minnum_v32bf16:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v12
-; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
-; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX10-NEXT:    v_lshlrev_b32_e32 v39, 16, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v11
-; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v14
+; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v32, 16, v14
 ; GFX10-NEXT:    v_and_b32_e32 v30, 0xffff0000, v30
 ; GFX10-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v13
-; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v13
 ; GFX10-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GFX10-NEXT:    v_min_f32_e32 v12, v12, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v28, 16, v22
-; GFX10-NEXT:    v_min_f32_e32 v39, v48, v39
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v6
-; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX10-NEXT:    v_min_f32_e32 v11, v11, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v27, 16, v21
-; GFX10-NEXT:    v_min_f32_e32 v49, v50, v49
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v5
-; GFX10-NEXT:    v_min_f32_e32 v33, v34, v33
-; GFX10-NEXT:    v_min_f32_e32 v14, v14, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v30, 16, v24
-; GFX10-NEXT:    v_min_f32_e32 v35, v36, v35
+; GFX10-NEXT:    v_min_f32_e32 v31, v32, v31
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v12
+; GFX10-NEXT:    v_min_f32_e32 v30, v14, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v14, 16, v29
+; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_bfe_u32 v32, v31, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v31
+; GFX10-NEXT:    v_bfe_u32 v35, v30, 16, 1
+; GFX10-NEXT:    v_min_f32_e32 v33, v33, v14
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v31, v31
+; GFX10-NEXT:    v_add3_u32 v32, v32, v31, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX10-NEXT:    v_add3_u32 v31, v35, v30, 0x7fff
+; GFX10-NEXT:    v_min_f32_e32 v35, v13, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v13, 16, v28
+; GFX10-NEXT:    v_cndmask_b32_e32 v14, v32, v34, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v30
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
+; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v21
+; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v5
+; GFX10-NEXT:    v_add3_u32 v30, v34, v33, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v29, v31, v32, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_min_f32_e32 v34, v36, v13
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_min_f32_e32 v33, v12, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v12, 16, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v11
+; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
+; GFX10-NEXT:    v_cndmask_b32_e32 v13, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_min_f32_e32 v35, v36, v12
+; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v10
+; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX10-NEXT:    v_cndmask_b32_e32 v28, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_min_f32_e32 v34, v11, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v11, 16, v26
+; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
+; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
+; GFX10-NEXT:    v_cndmask_b32_e32 v12, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_min_f32_e32 v33, v36, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v9
+; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v3
+; GFX10-NEXT:    v_cndmask_b32_e32 v27, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_min_f32_e32 v35, v10, v26
+; GFX10-NEXT:    v_lshlrev_b32_e32 v10, 16, v25
+; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
+; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v2
+; GFX10-NEXT:    v_cndmask_b32_e32 v11, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_min_f32_e32 v34, v36, v10
+; GFX10-NEXT:    v_min_f32_e32 v9, v9, v25
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v8
-; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
 ; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX10-NEXT:    v_min_f32_e32 v13, v13, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v29, 16, v23
-; GFX10-NEXT:    v_min_f32_e32 v37, v38, v37
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v26, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v24
+; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_lshlrev_b32_e32 v52, 16, v1
+; GFX10-NEXT:    v_cndmask_b32_e32 v10, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_min_f32_e32 v33, v36, v33
+; GFX10-NEXT:    v_min_f32_e32 v8, v8, v24
+; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v23
+; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v25, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v9, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
 ; GFX10-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GFX10-NEXT:    v_min_f32_e32 v6, v6, v22
-; GFX10-NEXT:    v_lshlrev_b32_e32 v22, 16, v16
-; GFX10-NEXT:    v_min_f32_e32 v27, v50, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v0
-; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
-; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
-; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v9
-; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GFX10-NEXT:    v_min_f32_e32 v8, v8, v24
-; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v18
-; GFX10-NEXT:    v_min_f32_e32 v29, v38, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v2
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_min_f32_e32 v24, v35, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v30, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v32, v9, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v9
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX10-NEXT:    v_min_f32_e32 v7, v7, v23
-; GFX10-NEXT:    v_lshlrev_b32_e32 v23, 16, v17
-; GFX10-NEXT:    v_min_f32_e32 v28, v48, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v1
-; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_bfe_u32 v23, v24, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v36, 0x400000, v24
+; GFX10-NEXT:    v_cmp_u_f32_e64 s4, v24, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v9, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v34, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v34, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v8
+; GFX10-NEXT:    v_bfe_u32 v35, v7, 16, 1
+; GFX10-NEXT:    v_add3_u32 v23, v23, v24, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e64 s5, v7, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v31, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v32, v34, v8, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v22
+; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
+; GFX10-NEXT:    v_add3_u32 v24, v35, v7, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_min_f32_e32 v8, v34, v8
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
 ; GFX10-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX10-NEXT:    v_min_f32_e32 v0, v0, v16
-; GFX10-NEXT:    v_bfe_u32 v16, v33, 16, 1
-; GFX10-NEXT:    v_min_f32_e32 v10, v10, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v26, 16, v20
-; GFX10-NEXT:    v_min_f32_e32 v34, v34, v51
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v4
+; GFX10-NEXT:    v_min_f32_e32 v6, v6, v22
+; GFX10-NEXT:    v_cndmask_b32_e32 v32, v32, v33, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s6, v8, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s7, v6, v6
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v15
+; GFX10-NEXT:    v_add3_u32 v7, v35, v8, 0x7fff
+; GFX10-NEXT:    v_min_f32_e32 v35, v38, v37
+; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v21
+; GFX10-NEXT:    v_bfe_u32 v37, v6, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v6
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, v22, s6
+; GFX10-NEXT:    v_bfe_u32 v21, v35, 16, 1
+; GFX10-NEXT:    v_min_f32_e32 v5, v5, v8
+; GFX10-NEXT:    v_add3_u32 v37, v37, v6, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v20
 ; GFX10-NEXT:    v_and_b32_e32 v20, 0xffff0000, v20
+; GFX10-NEXT:    v_add3_u32 v6, v21, v35, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v21, 16, v4
+; GFX10-NEXT:    v_bfe_u32 v48, v5, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX10-NEXT:    v_min_f32_e32 v9, v9, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v25, 16, v19
-; GFX10-NEXT:    v_min_f32_e32 v30, v36, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v3
+; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v35
+; GFX10-NEXT:    v_cmp_u_f32_e64 s8, v35, v35
+; GFX10-NEXT:    v_min_f32_e32 v8, v21, v8
+; GFX10-NEXT:    v_add3_u32 v21, v48, v5, 0x7fff
+; GFX10-NEXT:    v_min_f32_e32 v4, v4, v20
+; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v19
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v5
+; GFX10-NEXT:    v_bfe_u32 v20, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s9, v5, v5
+; GFX10-NEXT:    v_bfe_u32 v5, v4, 16, 1
+; GFX10-NEXT:    v_min_f32_e32 v48, v49, v48
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v18
+; GFX10-NEXT:    v_add3_u32 v20, v20, v8, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s10, v8, v8
+; GFX10-NEXT:    v_add3_u32 v5, v5, v4, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v4
+; GFX10-NEXT:    v_cmp_u_f32_e64 s11, v4, v4
+; GFX10-NEXT:    v_bfe_u32 v4, v48, 16, 1
+; GFX10-NEXT:    v_min_f32_e32 v49, v51, v49
+; GFX10-NEXT:    v_or_b32_e32 v51, 0x400000, v48
+; GFX10-NEXT:    v_cmp_u_f32_e64 s12, v48, v48
 ; GFX10-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_add3_u32 v4, v4, v48, 0x7fff
+; GFX10-NEXT:    v_bfe_u32 v48, v49, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s13, v49, v49
+; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
+; GFX10-NEXT:    v_min_f32_e32 v3, v3, v19
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, v39, s8
+; GFX10-NEXT:    v_add3_u32 v19, v48, v49, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v49
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v17
 ; GFX10-NEXT:    v_min_f32_e32 v2, v2, v18
-; GFX10-NEXT:    v_min_f32_e32 v18, v48, v23
+; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v21, v35, s9
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v20, v50, s10
+; GFX10-NEXT:    v_min_f32_e32 v49, v52, v49
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v5, v8, s11
 ; GFX10-NEXT:    v_min_f32_e32 v1, v1, v17
-; GFX10-NEXT:    v_min_f32_e32 v17, v50, v22
-; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v33
-; GFX10-NEXT:    v_bfe_u32 v23, v14, 16, 1
-; GFX10-NEXT:    v_add3_u32 v16, v16, v33, 0x7fff
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX10-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
-; GFX10-NEXT:    v_min_f32_e32 v4, v4, v20
-; GFX10-NEXT:    v_min_f32_e32 v20, v36, v25
-; GFX10-NEXT:    v_min_f32_e32 v3, v3, v19
-; GFX10-NEXT:    v_min_f32_e32 v19, v38, v24
-; GFX10-NEXT:    v_or_b32_e32 v24, 0x400000, v14
-; GFX10-NEXT:    v_bfe_u32 v25, v35, 16, 1
-; GFX10-NEXT:    v_add3_u32 v23, v23, v14, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v22, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX10-NEXT:    v_min_f32_e32 v5, v5, v21
-; GFX10-NEXT:    v_min_f32_e32 v21, v51, v26
-; GFX10-NEXT:    v_or_b32_e32 v26, 0x400000, v35
-; GFX10-NEXT:    v_bfe_u32 v36, v13, 16, 1
-; GFX10-NEXT:    v_add3_u32 v25, v25, v35, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v23, v23, v24, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v13
-; GFX10-NEXT:    v_bfe_u32 v48, v37, 16, 1
-; GFX10-NEXT:    v_add3_u32 v36, v36, v13, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v37
-; GFX10-NEXT:    v_cndmask_b32_e32 v25, v25, v26, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
-; GFX10-NEXT:    v_bfe_u32 v51, v12, 16, 1
-; GFX10-NEXT:    v_add3_u32 v48, v48, v37, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v12
-; GFX10-NEXT:    v_bfe_u32 v22, v39, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v36, v36, v38, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX10-NEXT:    v_add3_u32 v51, v51, v12, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v14, 0x400000, v39
-; GFX10-NEXT:    v_bfe_u32 v24, v11, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v39, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v48, v48, v50, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v11
-; GFX10-NEXT:    v_bfe_u32 v26, v49, 16, 1
-; GFX10-NEXT:    v_add3_u32 v24, v24, v11, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v13, 0x400000, v49
-; GFX10-NEXT:    v_cndmask_b32_e32 v33, v51, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v39, v39
-; GFX10-NEXT:    v_bfe_u32 v38, v10, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v49, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v37, 0x400000, v10
-; GFX10-NEXT:    v_bfe_u32 v50, v34, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, v22, v14, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v11, v11
-; GFX10-NEXT:    v_add3_u32 v38, v38, v10, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v34
-; GFX10-NEXT:    v_bfe_u32 v51, v9, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v34, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v24, v24, v35, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v49, v49
-; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v9
-; GFX10-NEXT:    v_bfe_u32 v22, v30, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v9, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v11, 0x400000, v30
-; GFX10-NEXT:    v_cndmask_b32_e32 v13, v26, v13, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v10, v10
-; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v30, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v49, 0x400000, v8
-; GFX10-NEXT:    v_bfe_u32 v26, v29, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v37, v38, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX10-NEXT:    v_add3_u32 v35, v35, v8, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v10, 0x400000, v29
-; GFX10-NEXT:    v_bfe_u32 v38, v7, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v29, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v12, v50, v12, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
-; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
-; GFX10-NEXT:    v_bfe_u32 v50, v28, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v7, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v9, 0x400000, v28
-; GFX10-NEXT:    v_cndmask_b32_e32 v39, v51, v39, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
-; GFX10-NEXT:    v_bfe_u32 v51, v6, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v28, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v30, 0x400000, v6
-; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, v22, v11, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
-; GFX10-NEXT:    v_bfe_u32 v22, v27, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v6, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v27
-; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v35, v35, v49, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v29, v29
-; GFX10-NEXT:    v_bfe_u32 v49, v5, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v27, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v29, 0x400000, v5
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v26, v10, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX10-NEXT:    v_bfe_u32 v26, v21, 16, 1
-; GFX10-NEXT:    v_add3_u32 v49, v49, v5, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v7, 0x400000, v21
-; GFX10-NEXT:    v_cndmask_b32_e32 v34, v38, v34, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v28, v28
-; GFX10-NEXT:    v_bfe_u32 v38, v4, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v21, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v28, 0x400000, v4
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v50, v9, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX10-NEXT:    v_bfe_u32 v50, v20, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v4, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v6, 0x400000, v20
-; GFX10-NEXT:    v_cndmask_b32_e32 v30, v51, v30, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v27, v27
-; GFX10-NEXT:    v_add3_u32 v50, v50, v20, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v51, v3, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v27, 0x400000, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v22, v8, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX10-NEXT:    v_bfe_u32 v22, v19, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v5, 0x400000, v19
-; GFX10-NEXT:    v_add3_u32 v51, v51, v3, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v29, v49, v29, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX10-NEXT:    v_add3_u32 v22, v22, v19, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v49, v2, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v21, 0x400000, v2
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, v26, v7, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX10-NEXT:    v_bfe_u32 v26, v18, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v4, 0x400000, v18
-; GFX10-NEXT:    v_add3_u32 v49, v49, v2, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v28, v38, v28, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX10-NEXT:    v_bfe_u32 v38, v1, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v18, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v50, v6, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
-; GFX10-NEXT:    v_bfe_u32 v50, v17, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v1, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v22, v5, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX10-NEXT:    v_bfe_u32 v22, v0, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v17, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v18, 0x400000, v0
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v26, v4, vcc_lo
+; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_bfe_u32 v18, v49, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v52, 0x400000, v49
+; GFX10-NEXT:    v_cmp_u_f32_e64 s14, v49, v49
+; GFX10-NEXT:    v_bfe_u32 v39, v1, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v1
+; GFX10-NEXT:    v_add3_u32 v18, v18, v49, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
+; GFX10-NEXT:    v_add3_u32 v39, v39, v1, 0x7fff
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v0, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v38, v20, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v19, v48, s13
+; GFX10-NEXT:    v_min_f32_e32 v17, v49, v17
+; GFX10-NEXT:    v_min_f32_e32 v0, v0, v16
+; GFX10-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v39, v35, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v22, v2, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v49, v17, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v17
+; GFX10-NEXT:    v_bfe_u32 v50, v0, 16, 1
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_perm_b32 v1, v1, v4, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v50, v19, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v0
+; GFX10-NEXT:    v_add3_u32 v49, v49, v17, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
+; GFX10-NEXT:    v_add3_u32 v50, v50, v0, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v23, v36, s4
+; GFX10-NEXT:    v_bfe_u32 v36, v3, 16, 1
+; GFX10-NEXT:    v_cndmask_b32_e32 v8, v49, v8, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
-; GFX10-NEXT:    v_perm_b32 v4, v28, v7, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v7, v34, v10, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v0, v22, v18, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v37, v37, v38, s7
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v2
+; GFX10-NEXT:    v_add3_u32 v22, v22, v2, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v24, v34, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v50, v48, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX10-NEXT:    v_perm_b32 v0, v0, v17, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v49, v21, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v3
+; GFX10-NEXT:    v_add3_u32 v36, v36, v3, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v18, v52, s14
+; GFX10-NEXT:    v_perm_b32 v0, v0, v8, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v22, v38, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX10-NEXT:    v_perm_b32 v2, v2, v5, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v51, v27, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v5, v29, v8, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v8, v35, v11, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v3, v3, v6, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v6, v30, v9, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v9, v39, v12, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v51, s12
+; GFX10-NEXT:    v_perm_b32 v1, v1, v18, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v9, v9, v30, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v2, v2, v19, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v36, v34, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v10, v25, v10, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v11, v26, v11, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v12, v27, v12, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v13, v28, v13, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v3, v3, v4, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v4, v5, v20, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v5, v21, v6, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v6, v37, v7, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v7, v24, v23, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v14, v29, v14, 0x7060302
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v32
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v32
-; GFX10-NEXT:    v_min_f32_e32 v17, v31, v17
-; GFX10-NEXT:    v_min_f32_e32 v15, v15, v18
-; GFX10-NEXT:    v_bfe_u32 v10, v17, 16, 1
-; GFX10-NEXT:    v_bfe_u32 v11, v15, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v17
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_min_f32_e32 v17, v33, v8
+; GFX10-NEXT:    v_min_f32_e32 v15, v15, v16
+; GFX10-NEXT:    v_perm_b32 v8, v32, v31, 0x7060302
+; GFX10-NEXT:    v_bfe_u32 v16, v17, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v18, v15, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v15
-; GFX10-NEXT:    v_add3_u32 v18, v10, v17, 0x7fff
-; GFX10-NEXT:    v_add3_u32 v11, v11, v15, 0x7fff
-; GFX10-NEXT:    v_perm_b32 v10, v37, v13, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v13, v36, v25, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v18, v12, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v15
+; GFX10-NEXT:    v_add3_u32 v16, v16, v17, 0x7fff
+; GFX10-NEXT:    v_add3_u32 v18, v18, v15, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v19, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX10-NEXT:    v_perm_b32 v12, v33, v48, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v15, v11, v19, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v11, v24, v14, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v14, v23, v16, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v15, v15, v17, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v15, v18, v20, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v15, v15, v16, 0x7060302
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11TRUE16-LABEL: v_minnum_v32bf16:
@@ -25238,38 +25238,38 @@ define <16 x bfloat> @v_maxnum_v16bf16(<16 x bfloat> %a, <16 x bfloat> %b) {
 ; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v22, 16, v0
 ; GFX11TRUE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v3.l, v3.h
-; GFX11TRUE16-NEXT:    v_dual_cndmask_b32 v10, v19, v21 :: v_dual_lshlrev_b32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v10, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v19, 0x400000, v2
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11TRUE16-NEXT:    v_dual_max_f32 v9, v22, v21 :: v_dual_and_b32 v8, 0xffff0000, v8
-; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
+; GFX11TRUE16-NEXT:    v_lshlrev_b32_e32 v21, 16, v8
+; GFX11TRUE16-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v2, v16, v19, vcc_lo
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v16, v1, 16, 1
+; GFX11TRUE16-NEXT:    v_max_f32_e32 v9, v22, v21
+; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v1
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11TRUE16-NEXT:    v_max_f32_e32 v0, v0, v8
+; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v8, v9, 16, 1
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v2.l, v2.h
-; GFX11TRUE16-NEXT:    v_add3_u32 v16, v16, v1, 0x7fff
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
-; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v1, v16, v22, vcc_lo
 ; GFX11TRUE16-NEXT:    v_max_f32_e32 v17, v24, v23
+; GFX11TRUE16-NEXT:    v_add3_u32 v8, v8, v9, 0x7fff
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v24, 0x400000, v9
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v22, 0x400000, v0
-; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v17, 16, 1
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v21, 0x400000, v17
+; GFX11TRUE16-NEXT:    v_mov_b16_e32 v1.l, v1.h
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v8, v8, v24, vcc_lo
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_add3_u32 v19, v23, v17, 0x7fff
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v23, v0, 16, 1
+; GFX11TRUE16-NEXT:    v_bfi_b32 v2, 0xffff, v2, v10
 ; GFX11TRUE16-NEXT:    v_mov_b16_e32 v8.l, v8.h
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
 ; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v9, v19, v21, vcc_lo
 ; GFX11TRUE16-NEXT:    v_add3_u32 v16, v23, v0, 0x7fff
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
@@ -26582,280 +26582,280 @@ define <32 x bfloat> @v_maxnum_v32bf16(<32 x bfloat> %a, <32 x bfloat> %b) {
 ; GFX10-LABEL: v_maxnum_v32bf16:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32
-; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v12
-; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
-; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX10-NEXT:    v_lshlrev_b32_e32 v39, 16, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v11
-; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
-; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v14
+; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v32, 16, v14
 ; GFX10-NEXT:    v_and_b32_e32 v30, 0xffff0000, v30
 ; GFX10-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v13
-; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v13
 ; GFX10-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GFX10-NEXT:    v_max_f32_e32 v12, v12, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v28, 16, v22
-; GFX10-NEXT:    v_max_f32_e32 v39, v48, v39
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v6
-; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
-; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX10-NEXT:    v_max_f32_e32 v11, v11, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v27, 16, v21
-; GFX10-NEXT:    v_max_f32_e32 v49, v50, v49
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v5
-; GFX10-NEXT:    v_max_f32_e32 v33, v34, v33
-; GFX10-NEXT:    v_max_f32_e32 v14, v14, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v30, 16, v24
-; GFX10-NEXT:    v_max_f32_e32 v35, v36, v35
+; GFX10-NEXT:    v_max_f32_e32 v31, v32, v31
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v12
+; GFX10-NEXT:    v_max_f32_e32 v30, v14, v30
+; GFX10-NEXT:    v_lshlrev_b32_e32 v14, 16, v29
+; GFX10-NEXT:    v_and_b32_e32 v29, 0xffff0000, v29
+; GFX10-NEXT:    v_bfe_u32 v32, v31, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v31
+; GFX10-NEXT:    v_bfe_u32 v35, v30, 16, 1
+; GFX10-NEXT:    v_max_f32_e32 v33, v33, v14
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v31, v31
+; GFX10-NEXT:    v_add3_u32 v32, v32, v31, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX10-NEXT:    v_add3_u32 v31, v35, v30, 0x7fff
+; GFX10-NEXT:    v_max_f32_e32 v35, v13, v29
+; GFX10-NEXT:    v_lshlrev_b32_e32 v13, 16, v28
+; GFX10-NEXT:    v_cndmask_b32_e32 v14, v32, v34, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v30
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
+; GFX10-NEXT:    v_and_b32_e32 v28, 0xffff0000, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v37, 16, v21
+; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v5
+; GFX10-NEXT:    v_add3_u32 v30, v34, v33, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v29, v31, v32, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_max_f32_e32 v34, v36, v13
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_max_f32_e32 v33, v12, v28
+; GFX10-NEXT:    v_lshlrev_b32_e32 v12, 16, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v11
+; GFX10-NEXT:    v_and_b32_e32 v27, 0xffff0000, v27
+; GFX10-NEXT:    v_cndmask_b32_e32 v13, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_max_f32_e32 v35, v36, v12
+; GFX10-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v10
+; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX10-NEXT:    v_cndmask_b32_e32 v28, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_max_f32_e32 v34, v11, v27
+; GFX10-NEXT:    v_lshlrev_b32_e32 v11, 16, v26
+; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
+; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
+; GFX10-NEXT:    v_cndmask_b32_e32 v12, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_max_f32_e32 v33, v36, v11
+; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v9
+; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v3
+; GFX10-NEXT:    v_cndmask_b32_e32 v27, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_max_f32_e32 v35, v10, v26
+; GFX10-NEXT:    v_lshlrev_b32_e32 v10, 16, v25
+; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
+; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v2
+; GFX10-NEXT:    v_cndmask_b32_e32 v11, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v33, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_max_f32_e32 v34, v36, v10
+; GFX10-NEXT:    v_max_f32_e32 v9, v9, v25
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v8
-; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
 ; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX10-NEXT:    v_max_f32_e32 v13, v13, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v29, 16, v23
-; GFX10-NEXT:    v_max_f32_e32 v37, v38, v37
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v26, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v32, v35, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v24
+; GFX10-NEXT:    v_and_b32_e32 v24, 0xffff0000, v24
+; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_lshlrev_b32_e32 v52, 16, v1
+; GFX10-NEXT:    v_cndmask_b32_e32 v10, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v35, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v35
+; GFX10-NEXT:    v_bfe_u32 v32, v34, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX10-NEXT:    v_max_f32_e32 v33, v36, v33
+; GFX10-NEXT:    v_max_f32_e32 v8, v8, v24
+; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v23
+; GFX10-NEXT:    v_lshlrev_b32_e32 v35, 16, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v25, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v30, v32, v34, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v31, 0x400000, v34
+; GFX10-NEXT:    v_bfe_u32 v32, v9, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX10-NEXT:    v_bfe_u32 v34, v33, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v23, 0xffff0000, v23
 ; GFX10-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GFX10-NEXT:    v_max_f32_e32 v6, v6, v22
-; GFX10-NEXT:    v_lshlrev_b32_e32 v22, 16, v16
-; GFX10-NEXT:    v_max_f32_e32 v27, v50, v27
-; GFX10-NEXT:    v_lshlrev_b32_e32 v50, 16, v0
-; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
-; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
-; GFX10-NEXT:    v_and_b32_e32 v26, 0xffff0000, v26
-; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v9
-; GFX10-NEXT:    v_and_b32_e32 v25, 0xffff0000, v25
-; GFX10-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GFX10-NEXT:    v_max_f32_e32 v8, v8, v24
-; GFX10-NEXT:    v_lshlrev_b32_e32 v24, 16, v18
-; GFX10-NEXT:    v_max_f32_e32 v29, v38, v29
-; GFX10-NEXT:    v_lshlrev_b32_e32 v38, 16, v2
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
-; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_max_f32_e32 v24, v35, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v30, v30, v31, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v32, v9, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v9
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
 ; GFX10-NEXT:    v_max_f32_e32 v7, v7, v23
-; GFX10-NEXT:    v_lshlrev_b32_e32 v23, 16, v17
-; GFX10-NEXT:    v_max_f32_e32 v28, v48, v28
-; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v1
-; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_bfe_u32 v23, v24, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v36, 0x400000, v24
+; GFX10-NEXT:    v_cmp_u_f32_e64 s4, v24, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v9, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v31, v34, v33, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v32, 0x400000, v33
+; GFX10-NEXT:    v_bfe_u32 v34, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v8
+; GFX10-NEXT:    v_bfe_u32 v35, v7, 16, 1
+; GFX10-NEXT:    v_add3_u32 v23, v23, v24, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e64 s5, v7, v7
+; GFX10-NEXT:    v_cndmask_b32_e32 v31, v31, v32, vcc_lo
+; GFX10-NEXT:    v_add3_u32 v32, v34, v8, 0x7fff
+; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v22
+; GFX10-NEXT:    v_lshlrev_b32_e32 v34, 16, v6
+; GFX10-NEXT:    v_add3_u32 v24, v35, v7, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v22, 0xffff0000, v22
+; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX10-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX10-NEXT:    v_max_f32_e32 v8, v34, v8
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
 ; GFX10-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX10-NEXT:    v_max_f32_e32 v0, v0, v16
-; GFX10-NEXT:    v_bfe_u32 v16, v33, 16, 1
-; GFX10-NEXT:    v_max_f32_e32 v10, v10, v26
-; GFX10-NEXT:    v_lshlrev_b32_e32 v26, 16, v20
-; GFX10-NEXT:    v_max_f32_e32 v34, v34, v51
-; GFX10-NEXT:    v_lshlrev_b32_e32 v51, 16, v4
+; GFX10-NEXT:    v_max_f32_e32 v6, v6, v22
+; GFX10-NEXT:    v_cndmask_b32_e32 v32, v32, v33, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s6, v8, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s7, v6, v6
+; GFX10-NEXT:    v_lshlrev_b32_e32 v33, 16, v15
+; GFX10-NEXT:    v_add3_u32 v7, v35, v8, 0x7fff
+; GFX10-NEXT:    v_max_f32_e32 v35, v38, v37
+; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff0000, v21
+; GFX10-NEXT:    v_bfe_u32 v37, v6, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v6
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v7, v22, s6
+; GFX10-NEXT:    v_bfe_u32 v21, v35, 16, 1
+; GFX10-NEXT:    v_max_f32_e32 v5, v5, v8
+; GFX10-NEXT:    v_add3_u32 v37, v37, v6, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v20
 ; GFX10-NEXT:    v_and_b32_e32 v20, 0xffff0000, v20
+; GFX10-NEXT:    v_add3_u32 v6, v21, v35, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v21, 16, v4
+; GFX10-NEXT:    v_bfe_u32 v48, v5, 16, 1
 ; GFX10-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX10-NEXT:    v_max_f32_e32 v9, v9, v25
-; GFX10-NEXT:    v_lshlrev_b32_e32 v25, 16, v19
-; GFX10-NEXT:    v_max_f32_e32 v30, v36, v30
-; GFX10-NEXT:    v_lshlrev_b32_e32 v36, 16, v3
+; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v35
+; GFX10-NEXT:    v_cmp_u_f32_e64 s8, v35, v35
+; GFX10-NEXT:    v_max_f32_e32 v8, v21, v8
+; GFX10-NEXT:    v_add3_u32 v21, v48, v5, 0x7fff
+; GFX10-NEXT:    v_max_f32_e32 v4, v4, v20
+; GFX10-NEXT:    v_lshlrev_b32_e32 v48, 16, v19
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v5
+; GFX10-NEXT:    v_bfe_u32 v20, v8, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s9, v5, v5
+; GFX10-NEXT:    v_bfe_u32 v5, v4, 16, 1
+; GFX10-NEXT:    v_max_f32_e32 v48, v49, v48
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v18
+; GFX10-NEXT:    v_add3_u32 v20, v20, v8, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v8
+; GFX10-NEXT:    v_cmp_u_f32_e64 s10, v8, v8
+; GFX10-NEXT:    v_add3_u32 v5, v5, v4, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v4
+; GFX10-NEXT:    v_cmp_u_f32_e64 s11, v4, v4
+; GFX10-NEXT:    v_bfe_u32 v4, v48, 16, 1
+; GFX10-NEXT:    v_max_f32_e32 v49, v51, v49
+; GFX10-NEXT:    v_or_b32_e32 v51, 0x400000, v48
+; GFX10-NEXT:    v_cmp_u_f32_e64 s12, v48, v48
 ; GFX10-NEXT:    v_and_b32_e32 v19, 0xffff0000, v19
-; GFX10-NEXT:    v_and_b32_e32 v3, 0xffff0000, v3
+; GFX10-NEXT:    v_add3_u32 v4, v4, v48, 0x7fff
+; GFX10-NEXT:    v_bfe_u32 v48, v49, 16, 1
+; GFX10-NEXT:    v_cmp_u_f32_e64 s13, v49, v49
+; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v18
+; GFX10-NEXT:    v_max_f32_e32 v3, v3, v19
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, v39, s8
+; GFX10-NEXT:    v_add3_u32 v19, v48, v49, 0x7fff
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v49
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v17
 ; GFX10-NEXT:    v_max_f32_e32 v2, v2, v18
-; GFX10-NEXT:    v_max_f32_e32 v18, v48, v23
+; GFX10-NEXT:    v_and_b32_e32 v17, 0xffff0000, v17
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v21, v35, s9
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v20, v50, s10
+; GFX10-NEXT:    v_max_f32_e32 v49, v52, v49
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v5, v8, s11
 ; GFX10-NEXT:    v_max_f32_e32 v1, v1, v17
-; GFX10-NEXT:    v_max_f32_e32 v17, v50, v22
-; GFX10-NEXT:    v_or_b32_e32 v22, 0x400000, v33
-; GFX10-NEXT:    v_bfe_u32 v23, v14, 16, 1
-; GFX10-NEXT:    v_add3_u32 v16, v16, v33, 0x7fff
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX10-NEXT:    v_and_b32_e32 v21, 0xffff0000, v21
-; GFX10-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
-; GFX10-NEXT:    v_max_f32_e32 v4, v4, v20
-; GFX10-NEXT:    v_max_f32_e32 v20, v36, v25
-; GFX10-NEXT:    v_max_f32_e32 v3, v3, v19
-; GFX10-NEXT:    v_max_f32_e32 v19, v38, v24
-; GFX10-NEXT:    v_or_b32_e32 v24, 0x400000, v14
-; GFX10-NEXT:    v_bfe_u32 v25, v35, 16, 1
-; GFX10-NEXT:    v_add3_u32 v23, v23, v14, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v22, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX10-NEXT:    v_max_f32_e32 v5, v5, v21
-; GFX10-NEXT:    v_max_f32_e32 v21, v51, v26
-; GFX10-NEXT:    v_or_b32_e32 v26, 0x400000, v35
-; GFX10-NEXT:    v_bfe_u32 v36, v13, 16, 1
-; GFX10-NEXT:    v_add3_u32 v25, v25, v35, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v23, v23, v24, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v13
-; GFX10-NEXT:    v_bfe_u32 v48, v37, 16, 1
-; GFX10-NEXT:    v_add3_u32 v36, v36, v13, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v50, 0x400000, v37
-; GFX10-NEXT:    v_cndmask_b32_e32 v25, v25, v26, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v13, v13
-; GFX10-NEXT:    v_bfe_u32 v51, v12, 16, 1
-; GFX10-NEXT:    v_add3_u32 v48, v48, v37, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v33, 0x400000, v12
-; GFX10-NEXT:    v_bfe_u32 v22, v39, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v36, v36, v38, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX10-NEXT:    v_add3_u32 v51, v51, v12, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v14, 0x400000, v39
-; GFX10-NEXT:    v_bfe_u32 v24, v11, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v39, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v48, v48, v50, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v11
-; GFX10-NEXT:    v_bfe_u32 v26, v49, 16, 1
-; GFX10-NEXT:    v_add3_u32 v24, v24, v11, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v13, 0x400000, v49
-; GFX10-NEXT:    v_cndmask_b32_e32 v33, v51, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v39, v39
-; GFX10-NEXT:    v_bfe_u32 v38, v10, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v49, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v37, 0x400000, v10
-; GFX10-NEXT:    v_bfe_u32 v50, v34, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, v22, v14, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v11, v11
-; GFX10-NEXT:    v_add3_u32 v38, v38, v10, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v34
-; GFX10-NEXT:    v_bfe_u32 v51, v9, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v34, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v24, v24, v35, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v49, v49
-; GFX10-NEXT:    v_or_b32_e32 v39, 0x400000, v9
-; GFX10-NEXT:    v_bfe_u32 v22, v30, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v9, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v11, 0x400000, v30
-; GFX10-NEXT:    v_cndmask_b32_e32 v13, v26, v13, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v10, v10
-; GFX10-NEXT:    v_bfe_u32 v35, v8, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v30, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v49, 0x400000, v8
-; GFX10-NEXT:    v_bfe_u32 v26, v29, 16, 1
-; GFX10-NEXT:    v_cndmask_b32_e32 v37, v38, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX10-NEXT:    v_add3_u32 v35, v35, v8, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v10, 0x400000, v29
-; GFX10-NEXT:    v_bfe_u32 v38, v7, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v29, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v12, v50, v12, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v9, v9
-; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v7
-; GFX10-NEXT:    v_bfe_u32 v50, v28, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v7, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v9, 0x400000, v28
-; GFX10-NEXT:    v_cndmask_b32_e32 v39, v51, v39, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v30, v30
-; GFX10-NEXT:    v_bfe_u32 v51, v6, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v28, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v30, 0x400000, v6
-; GFX10-NEXT:    v_lshlrev_b32_e32 v31, 16, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, v22, v11, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v8, v8
-; GFX10-NEXT:    v_bfe_u32 v22, v27, 16, 1
-; GFX10-NEXT:    v_add3_u32 v51, v51, v6, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v27
-; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v35, v35, v49, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v29, v29
-; GFX10-NEXT:    v_bfe_u32 v49, v5, 16, 1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v27, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v29, 0x400000, v5
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v26, v10, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX10-NEXT:    v_bfe_u32 v26, v21, 16, 1
-; GFX10-NEXT:    v_add3_u32 v49, v49, v5, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v7, 0x400000, v21
-; GFX10-NEXT:    v_cndmask_b32_e32 v34, v38, v34, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v28, v28
-; GFX10-NEXT:    v_bfe_u32 v38, v4, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v21, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v28, 0x400000, v4
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v50, v9, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX10-NEXT:    v_bfe_u32 v50, v20, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v4, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v6, 0x400000, v20
-; GFX10-NEXT:    v_cndmask_b32_e32 v30, v51, v30, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v27, v27
-; GFX10-NEXT:    v_add3_u32 v50, v50, v20, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v51, v3, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v27, 0x400000, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v22, v8, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX10-NEXT:    v_bfe_u32 v22, v19, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v5, 0x400000, v19
-; GFX10-NEXT:    v_add3_u32 v51, v51, v3, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v29, v49, v29, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX10-NEXT:    v_add3_u32 v22, v22, v19, 0x7fff
-; GFX10-NEXT:    v_bfe_u32 v49, v2, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v21, 0x400000, v2
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, v26, v7, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX10-NEXT:    v_bfe_u32 v26, v18, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v4, 0x400000, v18
-; GFX10-NEXT:    v_add3_u32 v49, v49, v2, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v28, v38, v28, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX10-NEXT:    v_bfe_u32 v38, v1, 16, 1
-; GFX10-NEXT:    v_add3_u32 v26, v26, v18, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v50, v6, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v19, v19
-; GFX10-NEXT:    v_bfe_u32 v50, v17, 16, 1
-; GFX10-NEXT:    v_add3_u32 v38, v38, v1, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v22, v5, vcc_lo
-; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX10-NEXT:    v_bfe_u32 v22, v0, 16, 1
-; GFX10-NEXT:    v_add3_u32 v50, v50, v17, 0x7fff
-; GFX10-NEXT:    v_or_b32_e32 v18, 0x400000, v0
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v26, v4, vcc_lo
+; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_bfe_u32 v18, v49, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v52, 0x400000, v49
+; GFX10-NEXT:    v_cmp_u_f32_e64 s14, v49, v49
+; GFX10-NEXT:    v_bfe_u32 v39, v1, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v35, 0x400000, v1
+; GFX10-NEXT:    v_add3_u32 v18, v18, v49, 0x7fff
+; GFX10-NEXT:    v_lshlrev_b32_e32 v49, 16, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
+; GFX10-NEXT:    v_add3_u32 v39, v39, v1, 0x7fff
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
-; GFX10-NEXT:    v_add3_u32 v22, v22, v0, 0x7fff
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v38, v20, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v19, v48, s13
+; GFX10-NEXT:    v_max_f32_e32 v17, v49, v17
+; GFX10-NEXT:    v_max_f32_e32 v0, v0, v16
+; GFX10-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v39, v35, vcc_lo
+; GFX10-NEXT:    v_bfe_u32 v22, v2, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v49, v17, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v8, 0x400000, v17
+; GFX10-NEXT:    v_bfe_u32 v50, v0, 16, 1
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_perm_b32 v1, v1, v4, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v50, v19, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v48, 0x400000, v0
+; GFX10-NEXT:    v_add3_u32 v49, v49, v17, 0x7fff
+; GFX10-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
+; GFX10-NEXT:    v_add3_u32 v50, v50, v0, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v23, v36, s4
+; GFX10-NEXT:    v_bfe_u32 v36, v3, 16, 1
+; GFX10-NEXT:    v_cndmask_b32_e32 v8, v49, v8, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
-; GFX10-NEXT:    v_perm_b32 v4, v28, v7, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v7, v34, v10, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v0, v22, v18, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v37, v37, v38, s7
+; GFX10-NEXT:    v_or_b32_e32 v38, 0x400000, v2
+; GFX10-NEXT:    v_add3_u32 v22, v22, v2, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v24, v34, s5
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v50, v48, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX10-NEXT:    v_perm_b32 v0, v0, v17, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v49, v21, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v34, 0x400000, v3
+; GFX10-NEXT:    v_add3_u32 v36, v36, v3, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v18, v52, s14
+; GFX10-NEXT:    v_perm_b32 v0, v0, v8, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v2, v22, v38, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX10-NEXT:    v_perm_b32 v2, v2, v5, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v51, v27, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v5, v29, v8, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v8, v35, v11, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v3, v3, v6, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v6, v30, v9, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v9, v39, v12, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v51, s12
+; GFX10-NEXT:    v_perm_b32 v1, v1, v18, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v9, v9, v30, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v2, v2, v19, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v36, v34, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v10, v25, v10, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v11, v26, v11, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v12, v27, v12, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v13, v28, v13, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v3, v3, v4, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v4, v5, v20, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v5, v21, v6, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v6, v37, v7, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v7, v24, v23, 0x7060302
+; GFX10-NEXT:    v_perm_b32 v14, v29, v14, 0x7060302
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_lshlrev_b32_e32 v17, 16, v32
-; GFX10-NEXT:    v_and_b32_e32 v18, 0xffff0000, v32
-; GFX10-NEXT:    v_max_f32_e32 v17, v31, v17
-; GFX10-NEXT:    v_max_f32_e32 v15, v15, v18
-; GFX10-NEXT:    v_bfe_u32 v10, v17, 16, 1
-; GFX10-NEXT:    v_bfe_u32 v11, v15, 16, 1
-; GFX10-NEXT:    v_or_b32_e32 v12, 0x400000, v17
+; GFX10-NEXT:    v_lshlrev_b32_e32 v8, 16, v16
+; GFX10-NEXT:    v_and_b32_e32 v16, 0xffff0000, v16
+; GFX10-NEXT:    v_max_f32_e32 v17, v33, v8
+; GFX10-NEXT:    v_max_f32_e32 v15, v15, v16
+; GFX10-NEXT:    v_perm_b32 v8, v32, v31, 0x7060302
+; GFX10-NEXT:    v_bfe_u32 v16, v17, 16, 1
+; GFX10-NEXT:    v_bfe_u32 v18, v15, 16, 1
+; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v17
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX10-NEXT:    v_or_b32_e32 v19, 0x400000, v15
-; GFX10-NEXT:    v_add3_u32 v18, v10, v17, 0x7fff
-; GFX10-NEXT:    v_add3_u32 v11, v11, v15, 0x7fff
-; GFX10-NEXT:    v_perm_b32 v10, v37, v13, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v13, v36, v25, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v18, v12, vcc_lo
+; GFX10-NEXT:    v_or_b32_e32 v20, 0x400000, v15
+; GFX10-NEXT:    v_add3_u32 v16, v16, v17, 0x7fff
+; GFX10-NEXT:    v_add3_u32 v18, v18, v15, 0x7fff
+; GFX10-NEXT:    v_cndmask_b32_e32 v16, v16, v19, vcc_lo
 ; GFX10-NEXT:    v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX10-NEXT:    v_perm_b32 v12, v33, v48, 0x7060302
-; GFX10-NEXT:    v_cndmask_b32_e32 v15, v11, v19, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v11, v24, v14, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v14, v23, v16, 0x7060302
-; GFX10-NEXT:    v_perm_b32 v15, v15, v17, 0x7060302
+; GFX10-NEXT:    v_cndmask_b32_e32 v15, v18, v20, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v15, v15, v16, 0x7060302
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11TRUE16-LABEL: v_maxnum_v32bf16:
@@ -41331,136 +41331,136 @@ define <16 x bfloat> @v_vselect_v16bf16(<16 x i1> %cond, <16 x bfloat> %a, <16 x
 ; GFX7-LABEL: v_vselect_v16bf16:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_and_b32_e32 v8, 1, v8
-; GFX7-NEXT:    v_and_b32_e32 v7, 1, v7
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[16:17], 1, v8
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[14:15], 1, v7
-; GFX7-NEXT:    buffer_load_dword v7, off, s[0:3], s32
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:64
-; GFX7-NEXT:    v_and_b32_e32 v15, 1, v15
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[12:13], 1, v15
-; GFX7-NEXT:    v_and_b32_e32 v14, 1, v14
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[10:11], 1, v14
-; GFX7-NEXT:    v_and_b32_e32 v13, 1, v13
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[8:9], 1, v13
-; GFX7-NEXT:    v_and_b32_e32 v12, 1, v12
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[6:7], 1, v12
-; GFX7-NEXT:    v_and_b32_e32 v11, 1, v11
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[4:5], 1, v11
-; GFX7-NEXT:    v_and_b32_e32 v10, 1, v10
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v10
-; GFX7-NEXT:    v_and_b32_e32 v6, 1, v6
-; GFX7-NEXT:    v_and_b32_e32 v5, 1, v5
-; GFX7-NEXT:    v_and_b32_e32 v9, 1, v9
-; GFX7-NEXT:    v_cmp_eq_u32_e64 s[18:19], 1, v9
-; GFX7-NEXT:    v_and_b32_e32 v4, 1, v4
-; GFX7-NEXT:    v_mul_f32_e32 v20, 1.0, v20
-; GFX7-NEXT:    v_and_b32_e32 v3, 1, v3
-; GFX7-NEXT:    v_mul_f32_e32 v19, 1.0, v19
-; GFX7-NEXT:    v_and_b32_e32 v2, 1, v2
-; GFX7-NEXT:    v_mul_f32_e32 v18, 1.0, v18
-; GFX7-NEXT:    v_and_b32_e32 v1, 1, v1
 ; GFX7-NEXT:    v_and_b32_e32 v0, 1, v0
-; GFX7-NEXT:    v_mul_f32_e32 v17, 1.0, v17
+; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v1
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[4:5], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v2
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[6:7], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v3
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[8:9], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v4
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[10:11], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v5
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[12:13], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v6
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[14:15], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v7
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[16:17], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v8
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[18:19], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v9
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[20:21], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v10
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[22:23], 1, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 1, v11
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[24:25], 1, v0
+; GFX7-NEXT:    buffer_load_dword v0, off, s[0:3], s32
+; GFX7-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:64
+; GFX7-NEXT:    v_and_b32_e32 v2, 1, v12
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[26:27], 1, v2
+; GFX7-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:60
+; GFX7-NEXT:    v_and_b32_e32 v3, 1, v13
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[28:29], 1, v3
+; GFX7-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:56
+; GFX7-NEXT:    v_and_b32_e32 v4, 1, v14
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[40:41], 1, v4
+; GFX7-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:52
+; GFX7-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:48
+; GFX7-NEXT:    v_and_b32_e32 v4, 1, v15
+; GFX7-NEXT:    v_cmp_eq_u32_e64 s[42:43], 1, v4
 ; GFX7-NEXT:    v_mul_f32_e32 v16, 1.0, v16
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v7
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v15, v8, v7, s[12:13]
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:60
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v30
-; GFX7-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v14, v8, v7, s[10:11]
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:56
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v29
-; GFX7-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v13, v8, v7, s[8:9]
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:52
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v28
-; GFX7-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v12, v8, v7, s[6:7]
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:48
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v27
-; GFX7-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v11, v8, v7, s[4:5]
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:44
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v26
+; GFX7-NEXT:    v_mul_f32_e32 v17, 1.0, v17
+; GFX7-NEXT:    v_mul_f32_e32 v18, 1.0, v18
+; GFX7-NEXT:    v_mul_f32_e32 v19, 1.0, v19
+; GFX7-NEXT:    v_mul_f32_e32 v20, 1.0, v20
+; GFX7-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:40
+; GFX7-NEXT:    s_waitcnt vmcnt(6)
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT:    s_waitcnt vmcnt(5)
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GFX7-NEXT:    v_cndmask_b32_e64 v15, v1, v0, s[42:43]
+; GFX7-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:44
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v30
+; GFX7-NEXT:    s_waitcnt vmcnt(5)
+; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GFX7-NEXT:    v_cndmask_b32_e64 v14, v2, v1, s[40:41]
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v29
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v3
+; GFX7-NEXT:    v_cndmask_b32_e64 v13, v2, v1, s[28:29]
+; GFX7-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:36
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v28
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v5
+; GFX7-NEXT:    v_cndmask_b32_e64 v12, v2, v1, s[26:27]
+; GFX7-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:32
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v27
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v6
+; GFX7-NEXT:    v_cndmask_b32_e64 v11, v5, v1, s[24:25]
+; GFX7-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:28
+; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v26
 ; GFX7-NEXT:    v_and_b32_e32 v11, 0xffff0000, v11
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e32 v10, v8, v7, vcc
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v6
-; GFX7-NEXT:    v_mul_f32_e32 v6, 1.0, v22
-; GFX7-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:28
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:40
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v25
-; GFX7-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v22, 1.0, v22
-; GFX7-NEXT:    v_cndmask_b32_e32 v6, v22, v6, vcc
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v5
+; GFX7-NEXT:    v_and_b32_e32 v12, 0xffff0000, v12
+; GFX7-NEXT:    v_and_b32_e32 v13, 0xffff0000, v13
+; GFX7-NEXT:    v_and_b32_e32 v14, 0xffff0000, v14
+; GFX7-NEXT:    v_and_b32_e32 v15, 0xffff0000, v15
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v4, 1.0, v4
+; GFX7-NEXT:    s_waitcnt vmcnt(3)
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT:    v_cndmask_b32_e64 v10, v0, v5, s[22:23]
+; GFX7-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24
+; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v25
+; GFX7-NEXT:    v_cndmask_b32_e64 v9, v4, v5, s[20:21]
+; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v24
+; GFX7-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:4
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v3, 1.0, v3
+; GFX7-NEXT:    v_cndmask_b32_e64 v8, v3, v5, s[18:19]
+; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v23
+; GFX7-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:8
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v2
+; GFX7-NEXT:    v_cndmask_b32_e64 v7, v2, v5, s[16:17]
+; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v22
+; GFX7-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:12
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, v1, v5, s[14:15]
 ; GFX7-NEXT:    v_mul_f32_e32 v5, 1.0, v21
-; GFX7-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:24
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v9, v8, v7, s[18:19]
-; GFX7-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:36
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v24
+; GFX7-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:16
 ; GFX7-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
+; GFX7-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
+; GFX7-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
 ; GFX7-NEXT:    v_and_b32_e32 v9, 0xffff0000, v9
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v21, 1.0, v21
-; GFX7-NEXT:    v_cndmask_b32_e32 v5, v21, v5, vcc
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v4
-; GFX7-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:20
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v8, 1.0, v8
-; GFX7-NEXT:    v_cndmask_b32_e64 v8, v8, v7, s[16:17]
-; GFX7-NEXT:    v_mul_f32_e32 v7, 1.0, v23
-; GFX7-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:32
+; GFX7-NEXT:    v_and_b32_e32 v10, 0xffff0000, v10
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT:    v_cndmask_b32_e64 v5, v0, v5, s[12:13]
+; GFX7-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:20
 ; GFX7-NEXT:    v_and_b32_e32 v5, 0xffff0000, v5
-; GFX7-NEXT:    v_and_b32_e32 v8, 0xffff0000, v8
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
+; GFX7-NEXT:    s_waitcnt vmcnt(4)
 ; GFX7-NEXT:    v_mul_f32_e32 v4, 1.0, v4
-; GFX7-NEXT:    v_cndmask_b32_e32 v4, v4, v20, vcc
-; GFX7-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:16
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
-; GFX7-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:12
-; GFX7-NEXT:    s_waitcnt vmcnt(2)
-; GFX7-NEXT:    v_mul_f32_e32 v23, 1.0, v23
-; GFX7-NEXT:    v_cndmask_b32_e64 v7, v23, v7, s[14:15]
-; GFX7-NEXT:    v_and_b32_e32 v4, 0xffff0000, v4
-; GFX7-NEXT:    v_and_b32_e32 v7, 0xffff0000, v7
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v20, 1.0, v20
-; GFX7-NEXT:    v_cndmask_b32_e32 v19, v20, v19, vcc
-; GFX7-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:4
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
-; GFX7-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:8
-; GFX7-NEXT:    s_waitcnt vmcnt(2)
+; GFX7-NEXT:    s_waitcnt vmcnt(3)
 ; GFX7-NEXT:    v_mul_f32_e32 v3, 1.0, v3
-; GFX7-NEXT:    v_cndmask_b32_e32 v3, v3, v18, vcc
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX7-NEXT:    s_waitcnt vmcnt(1)
-; GFX7-NEXT:    v_mul_f32_e32 v18, 1.0, v20
-; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    s_waitcnt vmcnt(2)
 ; GFX7-NEXT:    v_mul_f32_e32 v2, 1.0, v2
-; GFX7-NEXT:    v_cndmask_b32_e32 v1, v2, v17, vcc
-; GFX7-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX7-NEXT:    v_cndmask_b32_e32 v0, v18, v16, vcc
-; GFX7-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
+; GFX7-NEXT:    v_cndmask_b32_e64 v2, v2, v18, s[6:7]
+; GFX7-NEXT:    v_and_b32_e32 v2, 0xffff0000, v2
+; GFX7-NEXT:    s_waitcnt vmcnt(1)
+; GFX7-NEXT:    v_mul_f32_e32 v1, 1.0, v1
+; GFX7-NEXT:    v_cndmask_b32_e64 v19, v1, v19, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v1, v3, v17, s[4:5]
 ; GFX7-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
-; GFX7-NEXT:    v_and_b32_e32 v2, 0xffff0000, v3
 ; GFX7-NEXT:    v_and_b32_e32 v3, 0xffff0000, v19
+; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT:    v_cndmask_b32_e64 v20, v0, v20, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e32 v0, v4, v16, vcc
+; GFX7-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
+; GFX7-NEXT:    v_and_b32_e32 v4, 0xffff0000, v20
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_vselect_v16bf16:
@@ -41513,19 +41513,19 @@ define <16 x bfloat> @v_vselect_v16bf16(<16 x i1> %cond, <16 x bfloat> %a, <16 x
 ; GFX8-NEXT:    v_cndmask_b32_e64 v7, v30, v22, s[26:27]
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
 ; GFX8-NEXT:    v_cndmask_b32_e64 v8, v29, v21, s[22:23]
-; GFX8-NEXT:    v_cndmask_b32_e64 v9, v28, v20, s[18:19]
+; GFX8-NEXT:    v_cndmask_b32_e64 v11, v28, v20, s[18:19]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v12, v27, v19, s[14:15]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v13, v26, v18, s[10:11]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v14, v25, v17, s[6:7]
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; GFX8-NEXT:    v_or_b32_sdwa v6, v7, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v4, v9, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v4, v11, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v5, v8, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_cndmask_b32_e64 v10, v0, v23, s[40:41]
+; GFX8-NEXT:    v_cndmask_b32_e64 v9, v0, v23, s[40:41]
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v0, 16, v0
-; GFX8-NEXT:    v_cndmask_b32_e64 v11, v0, v1, s[42:43]
+; GFX8-NEXT:    v_cndmask_b32_e64 v10, v0, v1, s[42:43]
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v0, 16, v19
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v1, 16, v27
 ; GFX8-NEXT:    v_cndmask_b32_e64 v3, v1, v0, s[16:17]
@@ -41542,153 +41542,153 @@ define <16 x bfloat> @v_vselect_v16bf16(<16 x i1> %cond, <16 x bfloat> %a, <16 x
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GFX8-NEXT:    v_lshlrev_b32_e32 v7, 16, v11
+; GFX8-NEXT:    v_lshlrev_b32_e32 v7, 16, v10
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v15, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v1, v14, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v2, v13, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v3, v12, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v7, v10, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v7, v9, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: v_vselect_v16bf16:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_and_b32_e32 v12, 1, v12
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v12
-; GFX9-NEXT:    v_and_b32_e32 v13, 1, v13
-; GFX9-NEXT:    v_cndmask_b32_e32 v12, v30, v22, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
-; GFX9-NEXT:    v_lshrrev_b32_e32 v30, 16, v30
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v13
-; GFX9-NEXT:    v_and_b32_e32 v10, 1, v10
-; GFX9-NEXT:    v_cndmask_b32_e32 v13, v30, v22, vcc
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v10
-; GFX9-NEXT:    v_and_b32_e32 v10, 1, v11
-; GFX9-NEXT:    v_cndmask_b32_e32 v11, v29, v21, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
-; GFX9-NEXT:    v_lshrrev_b32_e32 v22, 16, v29
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v10
-; GFX9-NEXT:    v_cndmask_b32_e32 v10, v22, v21, vcc
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32
-; GFX9-NEXT:    v_and_b32_e32 v8, 1, v8
-; GFX9-NEXT:    v_and_b32_e32 v9, 1, v9
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v8
-; GFX9-NEXT:    v_lshrrev_b32_e32 v8, 16, v20
-; GFX9-NEXT:    v_cndmask_b32_e32 v20, v28, v20, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v22, 16, v28
 ; GFX9-NEXT:    v_and_b32_e32 v6, 1, v6
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v9
-; GFX9-NEXT:    v_and_b32_e32 v7, 1, v7
-; GFX9-NEXT:    v_cndmask_b32_e32 v8, v22, v8, vcc
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v6
-; GFX9-NEXT:    v_lshrrev_b32_e32 v9, 16, v19
-; GFX9-NEXT:    v_lshrrev_b32_e32 v22, 16, v27
+; GFX9-NEXT:    v_and_b32_e32 v6, 1, v8
+; GFX9-NEXT:    v_cmp_eq_u32_e64 s[4:5], 1, v6
+; GFX9-NEXT:    v_and_b32_e32 v6, 1, v10
+; GFX9-NEXT:    v_cmp_eq_u32_e64 s[6:7], 1, v6
+; GFX9-NEXT:    v_and_b32_e32 v6, 1, v12
+; GFX9-NEXT:    v_cmp_eq_u32_e64 s[8:9], 1, v6
+; GFX9-NEXT:    v_and_b32_e32 v8, 1, v13
+; GFX9-NEXT:    v_cndmask_b32_e64 v6, v30, v22, s[8:9]
+; GFX9-NEXT:    v_cmp_eq_u32_e64 s[8:9], 1, v8
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32
+; GFX9-NEXT:    v_lshrrev_b32_e32 v10, 16, v22
+; GFX9-NEXT:    v_lshrrev_b32_e32 v12, 16, v30
+; GFX9-NEXT:    v_and_b32_e32 v11, 1, v11
+; GFX9-NEXT:    v_and_b32_e32 v9, 1, v9
+; GFX9-NEXT:    v_and_b32_e32 v7, 1, v7
 ; GFX9-NEXT:    v_and_b32_e32 v4, 1, v4
+; GFX9-NEXT:    v_and_b32_e32 v13, 1, v14
+; GFX9-NEXT:    v_cndmask_b32_e64 v10, v12, v10, s[8:9]
+; GFX9-NEXT:    v_lshrrev_b32_e32 v12, 16, v21
+; GFX9-NEXT:    v_cndmask_b32_e64 v14, v29, v21, s[6:7]
+; GFX9-NEXT:    v_lshrrev_b32_e32 v21, 16, v29
+; GFX9-NEXT:    v_cmp_eq_u32_e64 s[6:7], 1, v11
+; GFX9-NEXT:    v_lshrrev_b32_e32 v11, 16, v20
+; GFX9-NEXT:    v_cndmask_b32_e64 v20, v28, v20, s[4:5]
+; GFX9-NEXT:    v_lshrrev_b32_e32 v22, 16, v19
+; GFX9-NEXT:    v_cmp_eq_u32_e64 s[4:5], 1, v9
+; GFX9-NEXT:    v_lshrrev_b32_e32 v9, 16, v27
 ; GFX9-NEXT:    v_cndmask_b32_e32 v19, v27, v19, vcc
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v7
+; GFX9-NEXT:    v_cndmask_b32_e64 v12, v21, v12, s[6:7]
+; GFX9-NEXT:    v_lshrrev_b32_e32 v21, 16, v28
 ; GFX9-NEXT:    v_and_b32_e32 v5, 1, v5
-; GFX9-NEXT:    v_cndmask_b32_e32 v9, v22, v9, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v9, v9, v22, vcc
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v4
-; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 16, v18
+; GFX9-NEXT:    v_cndmask_b32_e64 v11, v21, v11, s[4:5]
+; GFX9-NEXT:    v_lshrrev_b32_e32 v21, 16, v18
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v27, 16, v26
-; GFX9-NEXT:    v_and_b32_e32 v14, 1, v14
 ; GFX9-NEXT:    v_cndmask_b32_e32 v4, v26, v18, vcc
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v5
 ; GFX9-NEXT:    v_and_b32_e32 v15, 1, v15
-; GFX9-NEXT:    v_cndmask_b32_e32 v5, v27, v6, vcc
-; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v14
-; GFX9-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT:    v_cndmask_b32_e32 v5, v27, v21, vcc
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v13
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v7, 16, v23
+; GFX9-NEXT:    v_and_b32_e32 v2, 1, v2
 ; GFX9-NEXT:    v_and_b32_e32 v3, 1, v3
 ; GFX9-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX9-NEXT:    v_and_b32_e32 v1, 1, v1
 ; GFX9-NEXT:    s_mov_b32 s4, 0x5040100
+; GFX9-NEXT:    v_perm_b32 v6, v10, v6, s4
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_cndmask_b32_e32 v14, v21, v23, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 16, v21
+; GFX9-NEXT:    v_cndmask_b32_e32 v13, v8, v23, vcc
+; GFX9-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v15
-; GFX9-NEXT:    v_cndmask_b32_e32 v7, v6, v7, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v7, v8, v7, vcc
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
 ; GFX9-NEXT:    v_cndmask_b32_e32 v2, v25, v17, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 16, v17
+; GFX9-NEXT:    v_lshrrev_b32_e32 v8, 16, v17
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v15, 16, v25
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v15, v6, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v3, v15, v8, vcc
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v24, v16, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 16, v16
+; GFX9-NEXT:    v_lshrrev_b32_e32 v8, 16, v16
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v15, 16, v24
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v15, v6, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v15, v8, vcc
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s4
 ; GFX9-NEXT:    v_perm_b32 v1, v3, v2, s4
 ; GFX9-NEXT:    v_perm_b32 v2, v5, v4, s4
 ; GFX9-NEXT:    v_perm_b32 v3, v9, v19, s4
-; GFX9-NEXT:    v_perm_b32 v4, v8, v20, s4
-; GFX9-NEXT:    v_perm_b32 v5, v10, v11, s4
-; GFX9-NEXT:    v_perm_b32 v6, v13, v12, s4
-; GFX9-NEXT:    v_perm_b32 v7, v7, v14, s4
+; GFX9-NEXT:    v_perm_b32 v4, v11, v20, s4
+; GFX9-NEXT:    v_perm_b32 v5, v12, v14, s4
+; GFX9-NEXT:    v_perm_b32 v7, v7, v13, s4
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: v_vselect_v16bf16:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX10-NEXT:    v_and_b32_e32 v12, 1, v12
 ; GFX10-NEXT:    v_and_b32_e32 v13, 1, v13
 ; GFX10-NEXT:    v_and_b32_e32 v10, 1, v10
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v33, 16, v22
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v34, 16, v30
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v12
 ; GFX10-NEXT:    v_and_b32_e32 v11, 1, v11
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v13
 ; GFX10-NEXT:    v_and_b32_e32 v8, 1, v8
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v35, 16, v21
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v29
-; GFX10-NEXT:    v_cndmask_b32_e32 v22, v30, v22, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v13
 ; GFX10-NEXT:    v_and_b32_e32 v9, 1, v9
+; GFX10-NEXT:    v_cndmask_b32_e32 v33, v34, v33, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v10
 ; GFX10-NEXT:    v_and_b32_e32 v6, 1, v6
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v20
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v38, 16, v28
-; GFX10-NEXT:    v_cndmask_b32_e32 v33, v34, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v10
 ; GFX10-NEXT:    v_and_b32_e32 v4, 1, v4
+; GFX10-NEXT:    v_cndmask_b32_e32 v10, v29, v21, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v11
 ; GFX10-NEXT:    v_and_b32_e32 v2, 1, v2
 ; GFX10-NEXT:    v_and_b32_e32 v3, 1, v3
 ; GFX10-NEXT:    v_and_b32_e32 v0, 1, v0
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v29, v21, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v11
+; GFX10-NEXT:    v_and_b32_e32 v12, 1, v12
+; GFX10-NEXT:    v_cndmask_b32_e32 v11, v36, v35, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v8
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v51, 16, v17
-; GFX10-NEXT:    v_lshrrev_b32_e32 v12, 16, v25
+; GFX10-NEXT:    v_lshrrev_b32_e32 v13, 16, v25
 ; GFX10-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 1, v12
+; GFX10-NEXT:    v_cndmask_b32_e32 v8, v28, v20, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v9
 ; GFX10-NEXT:    v_and_b32_e32 v5, 1, v5
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, v36, v35, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v8
-; GFX10-NEXT:    v_lshrrev_b32_e32 v30, 16, v16
-; GFX10-NEXT:    v_lshrrev_b32_e32 v13, 16, v24
+; GFX10-NEXT:    v_lshrrev_b32_e32 v12, 16, v16
+; GFX10-NEXT:    v_cndmask_b32_e64 v22, v30, v22, s4
+; GFX10-NEXT:    v_lshrrev_b32_e32 v30, 16, v24
+; GFX10-NEXT:    v_cndmask_b32_e32 v9, v38, v37, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v6
 ; GFX10-NEXT:    v_and_b32_e32 v7, 1, v7
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v49, 16, v18
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v28, v20, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v9
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v50, 16, v26
 ; GFX10-NEXT:    v_and_b32_e32 v14, 1, v14
+; GFX10-NEXT:    v_cndmask_b32_e32 v6, v27, v19, vcc_lo
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v4
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v39, 16, v19
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v48, 16, v27
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v38, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v6
 ; GFX10-NEXT:    v_and_b32_e32 v15, 1, v15
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v32, 16, v23
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v27, v19, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v4
 ; GFX10-NEXT:    v_cndmask_b32_e32 v4, v26, v18, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v2
 ; GFX10-NEXT:    v_cndmask_b32_e32 v2, v25, v17, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v12, v51, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v3, v13, v51, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v0
 ; GFX10-NEXT:    v_cndmask_b32_e32 v0, v24, v16, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v13, v30, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e32 v1, v30, v12, vcc_lo
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v5
 ; GFX10-NEXT:    v_perm_b32 v0, v1, v0, 0x5040100
 ; GFX10-NEXT:    v_cndmask_b32_e32 v5, v50, v49, vcc_lo
@@ -42608,35 +42608,35 @@ define <32 x bfloat> @v_vselect_v32bf16(<32 x i1> %cond, <32 x bfloat> %a, <32 x
 ; GFX8-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:44
 ; GFX8-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:112
 ; GFX8-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:48
-; GFX8-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:116
-; GFX8-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:52
-; GFX8-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
-; GFX8-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:56
-; GFX8-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:124
-; GFX8-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:60
-; GFX8-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:128
-; GFX8-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:64
+; GFX8-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:116
+; GFX8-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:52
+; GFX8-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:120
+; GFX8-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:56
+; GFX8-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:124
+; GFX8-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:60
+; GFX8-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:128
+; GFX8-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:64
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v29
+; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v25
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_lshrrev_b32_e32 v28, 16, v32
-; GFX8-NEXT:    v_cndmask_b32_e64 v28, v33, v28, s[38:39]
-; GFX8-NEXT:    v_cndmask_b32_e64 v29, v29, v32, s[36:37]
-; GFX8-NEXT:    v_lshrrev_b32_e32 v32, 16, v31
-; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v30
-; GFX8-NEXT:    v_cndmask_b32_e64 v32, v33, v32, s[34:35]
-; GFX8-NEXT:    v_cndmask_b32_e64 v30, v30, v31, s[30:31]
-; GFX8-NEXT:    v_lshrrev_b32_e32 v31, 16, v27
-; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v26
-; GFX8-NEXT:    v_cndmask_b32_e64 v31, v33, v31, s[90:91]
-; GFX8-NEXT:    v_cndmask_b32_e64 v26, v26, v27, s[88:89]
-; GFX8-NEXT:    v_lshrrev_b32_e32 v27, 16, v25
-; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v24
-; GFX8-NEXT:    v_cndmask_b32_e64 v27, v33, v27, s[78:79]
-; GFX8-NEXT:    v_cndmask_b32_e64 v24, v24, v25, s[76:77]
-; GFX8-NEXT:    v_lshrrev_b32_e32 v25, 16, v23
+; GFX8-NEXT:    v_lshrrev_b32_e32 v24, 16, v26
+; GFX8-NEXT:    v_cndmask_b32_e64 v24, v33, v24, s[38:39]
+; GFX8-NEXT:    v_cndmask_b32_e64 v25, v25, v26, s[36:37]
+; GFX8-NEXT:    v_lshrrev_b32_e32 v26, 16, v28
+; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v27
+; GFX8-NEXT:    v_cndmask_b32_e64 v26, v33, v26, s[34:35]
+; GFX8-NEXT:    v_cndmask_b32_e64 v27, v27, v28, s[30:31]
+; GFX8-NEXT:    v_lshrrev_b32_e32 v28, 16, v30
+; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v29
+; GFX8-NEXT:    v_cndmask_b32_e64 v28, v33, v28, s[90:91]
+; GFX8-NEXT:    v_cndmask_b32_e64 v29, v29, v30, s[88:89]
+; GFX8-NEXT:    v_lshrrev_b32_e32 v30, 16, v32
+; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v31
+; GFX8-NEXT:    v_cndmask_b32_e64 v30, v33, v30, s[78:79]
+; GFX8-NEXT:    v_cndmask_b32_e64 v31, v31, v32, s[76:77]
+; GFX8-NEXT:    v_lshrrev_b32_e32 v32, 16, v23
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v22
-; GFX8-NEXT:    v_cndmask_b32_e64 v25, v33, v25, s[74:75]
+; GFX8-NEXT:    v_cndmask_b32_e64 v32, v33, v32, s[74:75]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v22, v22, v23, s[72:73]
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v23, 16, v21
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v33, 16, v20
@@ -42701,19 +42701,19 @@ define <32 x bfloat> @v_vselect_v32bf16(<32 x i1> %cond, <32 x bfloat> %a, <32 x
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v8, 16, v19
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v9, 16, v21
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v10, 16, v23
-; GFX8-NEXT:    v_lshlrev_b32_e32 v11, 16, v25
-; GFX8-NEXT:    v_lshlrev_b32_e32 v12, 16, v27
-; GFX8-NEXT:    v_lshlrev_b32_e32 v13, 16, v31
-; GFX8-NEXT:    v_lshlrev_b32_e32 v14, 16, v32
-; GFX8-NEXT:    v_lshlrev_b32_e32 v15, 16, v28
+; GFX8-NEXT:    v_lshlrev_b32_e32 v11, 16, v32
+; GFX8-NEXT:    v_lshlrev_b32_e32 v12, 16, v30
+; GFX8-NEXT:    v_lshlrev_b32_e32 v13, 16, v28
+; GFX8-NEXT:    v_lshlrev_b32_e32 v14, 16, v26
+; GFX8-NEXT:    v_lshlrev_b32_e32 v15, 16, v24
 ; GFX8-NEXT:    v_or_b32_sdwa v8, v16, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v9, v18, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v10, v20, v10 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v11, v22, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v12, v24, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v13, v26, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v14, v30, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v15, v29, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v12, v31, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v13, v29, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v14, v27, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v15, v25, v15 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_readlane_b32 s39, v34, 7
 ; GFX8-NEXT:    v_readlane_b32 s38, v34, 6
 ; GFX8-NEXT:    v_readlane_b32 s37, v34, 5
@@ -42833,19 +42833,19 @@ define <32 x bfloat> @v_vselect_v32bf16(<32 x i1> %cond, <32 x bfloat> %a, <32 x
 ; GFX9-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
 ; GFX9-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:56
 ; GFX9-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:124
-; GFX9-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:60
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:128
 ; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:64
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_cndmask_b32_e64 v29, v31, v32, s[34:35]
+; GFX9-NEXT:    v_cndmask_b32_e64 v30, v31, v32, s[34:35]
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v32, 16, v32
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v31, 16, v31
 ; GFX9-NEXT:    v_cndmask_b32_e64 v31, v31, v32, s[30:31]
-; GFX9-NEXT:    v_cndmask_b32_e64 v32, v28, v30, s[94:95]
-; GFX9-NEXT:    v_lshrrev_b32_e32 v30, 16, v30
+; GFX9-NEXT:    v_cndmask_b32_e64 v32, v28, v29, s[94:95]
+; GFX9-NEXT:    v_lshrrev_b32_e32 v29, 16, v29
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v28, 16, v28
-; GFX9-NEXT:    v_cndmask_b32_e64 v28, v28, v30, s[92:93]
-; GFX9-NEXT:    v_cndmask_b32_e64 v30, v26, v27, s[90:91]
+; GFX9-NEXT:    v_cndmask_b32_e64 v28, v28, v29, s[92:93]
+; GFX9-NEXT:    v_cndmask_b32_e64 v29, v26, v27, s[90:91]
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v27, 16, v27
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v26, 16, v26
 ; GFX9-NEXT:    v_cndmask_b32_e64 v26, v26, v27, s[88:89]
@@ -42915,9 +42915,9 @@ define <32 x bfloat> @v_vselect_v32bf16(<32 x i1> %cond, <32 x bfloat> %a, <32 x
 ; GFX9-NEXT:    v_perm_b32 v10, v20, v23, s4
 ; GFX9-NEXT:    v_perm_b32 v11, v22, v25, s4
 ; GFX9-NEXT:    v_perm_b32 v12, v24, v27, s4
-; GFX9-NEXT:    v_perm_b32 v13, v26, v30, s4
+; GFX9-NEXT:    v_perm_b32 v13, v26, v29, s4
 ; GFX9-NEXT:    v_perm_b32 v14, v28, v32, s4
-; GFX9-NEXT:    v_perm_b32 v15, v31, v29, s4
+; GFX9-NEXT:    v_perm_b32 v15, v31, v30, s4
 ; GFX9-NEXT:    v_readlane_b32 s35, v33, 3
 ; GFX9-NEXT:    v_readlane_b32 s34, v33, 2
 ; GFX9-NEXT:    v_readlane_b32 s31, v33, 1
@@ -42931,206 +42931,186 @@ define <32 x bfloat> @v_vselect_v32bf16(<32 x i1> %cond, <32 x bfloat> %a, <32 x
 ; GFX10-LABEL: v_vselect_v32bf16:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    s_clause 0xa
-; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:28
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:92
-; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:40
-; GFX10-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:104
-; GFX10-NEXT:    buffer_load_ushort v35, off, s[0:3], s32
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:128
-; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:64
-; GFX10-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:96
-; GFX10-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:108
-; GFX10-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:44
-; GFX10-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:112
-; GFX10-NEXT:    v_and_b32_e32 v30, 1, v30
-; GFX10-NEXT:    v_and_b32_e32 v18, 1, v18
-; GFX10-NEXT:    v_and_b32_e32 v12, 1, v12
-; GFX10-NEXT:    v_and_b32_e32 v13, 1, v13
-; GFX10-NEXT:    v_and_b32_e32 v19, 1, v19
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v30
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 1, v18
-; GFX10-NEXT:    v_and_b32_e32 v28, 1, v28
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 1, v13
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 1, v19
-; GFX10-NEXT:    v_and_b32_e32 v26, 1, v26
-; GFX10-NEXT:    v_and_b32_e32 v24, 1, v24
-; GFX10-NEXT:    v_and_b32_e32 v22, 1, v22
-; GFX10-NEXT:    v_and_b32_e32 v20, 1, v20
-; GFX10-NEXT:    v_and_b32_e32 v21, 1, v21
-; GFX10-NEXT:    v_and_b32_e32 v16, 1, v16
-; GFX10-NEXT:    v_and_b32_e32 v14, 1, v14
-; GFX10-NEXT:    v_and_b32_e32 v17, 1, v17
-; GFX10-NEXT:    v_and_b32_e32 v15, 1, v15
-; GFX10-NEXT:    v_and_b32_e32 v10, 1, v10
-; GFX10-NEXT:    v_and_b32_e32 v8, 1, v8
-; GFX10-NEXT:    v_and_b32_e32 v6, 1, v6
-; GFX10-NEXT:    v_and_b32_e32 v4, 1, v4
-; GFX10-NEXT:    v_and_b32_e32 v2, 1, v2
 ; GFX10-NEXT:    v_and_b32_e32 v0, 1, v0
-; GFX10-NEXT:    v_and_b32_e32 v11, 1, v11
-; GFX10-NEXT:    v_and_b32_e32 v7, 1, v7
-; GFX10-NEXT:    v_and_b32_e32 v3, 1, v3
 ; GFX10-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX10-NEXT:    v_and_b32_e32 v5, 1, v5
-; GFX10-NEXT:    v_and_b32_e32 v9, 1, v9
-; GFX10-NEXT:    s_waitcnt vmcnt(10)
-; GFX10-NEXT:    v_lshrrev_b32_e32 v30, 16, v31
-; GFX10-NEXT:    s_waitcnt vmcnt(9)
-; GFX10-NEXT:    v_lshrrev_b32_e32 v50, 16, v32
-; GFX10-NEXT:    s_waitcnt vmcnt(8)
-; GFX10-NEXT:    v_lshrrev_b32_e32 v13, 16, v33
-; GFX10-NEXT:    s_waitcnt vmcnt(7)
-; GFX10-NEXT:    v_cndmask_b32_e64 v18, v34, v33, s6
-; GFX10-NEXT:    s_waitcnt vmcnt(6)
-; GFX10-NEXT:    v_and_b32_e32 v35, 1, v35
-; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 1, v12
-; GFX10-NEXT:    s_waitcnt vmcnt(4)
-; GFX10-NEXT:    v_cndmask_b32_e32 v54, v36, v37, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v35
-; GFX10-NEXT:    v_lshrrev_b32_e32 v51, 16, v34
-; GFX10-NEXT:    v_cndmask_b32_e64 v12, v32, v31, s6
-; GFX10-NEXT:    s_clause 0x6
-; GFX10-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:68
-; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:72
-; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
-; GFX10-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:76
-; GFX10-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:12
-; GFX10-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:80
-; GFX10-NEXT:    v_cndmask_b32_e64 v30, v50, v30, s4
-; GFX10-NEXT:    v_cndmask_b32_e32 v35, v36, v37, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:124
-; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:60
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v28
-; GFX10-NEXT:    v_and_b32_e32 v28, 1, v29
-; GFX10-NEXT:    v_cndmask_b32_e64 v13, v51, v13, s5
-; GFX10-NEXT:    s_waitcnt vmcnt(3)
-; GFX10-NEXT:    v_lshrrev_b32_e32 v50, 16, v52
-; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_cndmask_b32_e32 v29, v36, v37, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v28
-; GFX10-NEXT:    v_cndmask_b32_e32 v28, v36, v37, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:120
-; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:56
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v26
-; GFX10-NEXT:    v_and_b32_e32 v26, 1, v27
-; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_cndmask_b32_e32 v27, v36, v37, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v26
-; GFX10-NEXT:    v_cndmask_b32_e32 v26, v36, v37, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:116
-; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:52
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v24
-; GFX10-NEXT:    v_and_b32_e32 v24, 1, v25
-; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_cndmask_b32_e32 v25, v36, v37, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v24
-; GFX10-NEXT:    v_cndmask_b32_e32 v24, v36, v37, vcc_lo
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:48
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v22
-; GFX10-NEXT:    v_and_b32_e32 v22, 1, v23
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v49
-; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_cndmask_b32_e32 v23, v49, v36, vcc_lo
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v22
-; GFX10-NEXT:    v_lshrrev_b32_e32 v49, 16, v53
-; GFX10-NEXT:    v_cndmask_b32_e32 v22, v37, v36, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v20
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v48
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v39
-; GFX10-NEXT:    v_cndmask_b32_e32 v20, v39, v48, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v21
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:32
-; GFX10-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:16
-; GFX10-NEXT:    v_cndmask_b32_e32 v21, v37, v36, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:100
-; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:36
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v16
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s4, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v3
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s5, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v2
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s6, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v5
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s7, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v4
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s8, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v7
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s9, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v6
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s10, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v9
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s11, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v8
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s12, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v11
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s13, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v10
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s14, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v13
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s15, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v12
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s16, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v15
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s17, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v14
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s18, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v17
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s19, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v16
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s20, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v19
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s21, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v18
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s22, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v21
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s23, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v20
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s24, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v23
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s25, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v22
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s26, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v25
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s27, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v24
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s28, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v27
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s29, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v26
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s40, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v29
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s41, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v28
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s42, 1, v0
+; GFX10-NEXT:    buffer_load_ushort v0, off, s[0:3], s32
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_cndmask_b32_e32 v16, v36, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v14
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
-; GFX10-NEXT:    v_cndmask_b32_e32 v14, v38, v39, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v17
-; GFX10-NEXT:    v_lshrrev_b32_e32 v39, 16, v39
-; GFX10-NEXT:    v_lshrrev_b32_e32 v38, 16, v38
-; GFX10-NEXT:    v_cndmask_b32_e32 v17, v36, v37, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:88
-; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:24
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v15
-; GFX10-NEXT:    v_cndmask_b32_e32 v15, v38, v39, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:84
-; GFX10-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:20
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v10
-; GFX10-NEXT:    s_waitcnt vmcnt(2)
-; GFX10-NEXT:    v_cndmask_b32_e32 v10, v36, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v8
-; GFX10-NEXT:    v_lshrrev_b32_e32 v37, 16, v37
-; GFX10-NEXT:    v_lshrrev_b32_e32 v36, 16, v36
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v0
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s43, 1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 1, v30
+; GFX10-NEXT:    v_cmp_eq_u32_e64 s44, 1, v0
+; GFX10-NEXT:    s_clause 0x1f
+; GFX10-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:68
+; GFX10-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
+; GFX10-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:72
+; GFX10-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:8
+; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:76
+; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:12
+; GFX10-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:80
+; GFX10-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:16
+; GFX10-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:84
+; GFX10-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:20
+; GFX10-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:88
+; GFX10-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:24
+; GFX10-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:92
+; GFX10-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:28
+; GFX10-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:96
+; GFX10-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:32
+; GFX10-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:100
+; GFX10-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:36
+; GFX10-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:104
+; GFX10-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:40
+; GFX10-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:108
+; GFX10-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:44
+; GFX10-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:112
+; GFX10-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:48
+; GFX10-NEXT:    buffer_load_dword v24, off, s[0:3], s32 offset:116
+; GFX10-NEXT:    buffer_load_dword v25, off, s[0:3], s32 offset:52
+; GFX10-NEXT:    buffer_load_dword v26, off, s[0:3], s32 offset:120
+; GFX10-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:56
+; GFX10-NEXT:    buffer_load_dword v28, off, s[0:3], s32 offset:124
+; GFX10-NEXT:    buffer_load_dword v29, off, s[0:3], s32 offset:60
+; GFX10-NEXT:    buffer_load_dword v30, off, s[0:3], s32 offset:128
+; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:64
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_cndmask_b32_e32 v8, v38, v39, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v6
-; GFX10-NEXT:    v_lshrrev_b32_e32 v39, 16, v39
-; GFX10-NEXT:    v_lshrrev_b32_e32 v38, 16, v38
-; GFX10-NEXT:    v_cndmask_b32_e32 v6, v53, v48, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v4
-; GFX10-NEXT:    v_lshrrev_b32_e32 v48, 16, v48
-; GFX10-NEXT:    v_cndmask_b32_e32 v4, v34, v52, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v2
-; GFX10-NEXT:    v_lshrrev_b32_e32 v34, 16, v34
-; GFX10-NEXT:    v_cndmask_b32_e32 v2, v32, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v0
-; GFX10-NEXT:    v_lshrrev_b32_e32 v33, 16, v33
-; GFX10-NEXT:    v_lshrrev_b32_e32 v32, 16, v32
-; GFX10-NEXT:    v_cndmask_b32_e32 v0, v19, v31, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v11
+; GFX10-NEXT:    v_cndmask_b32_e64 v32, v30, v31, s44
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v31, 16, v31
+; GFX10-NEXT:    v_lshrrev_b32_e32 v30, 16, v30
+; GFX10-NEXT:    v_cndmask_b32_e64 v30, v30, v31, s43
+; GFX10-NEXT:    v_cndmask_b32_e64 v31, v28, v29, s42
+; GFX10-NEXT:    v_lshrrev_b32_e32 v29, 16, v29
+; GFX10-NEXT:    v_lshrrev_b32_e32 v28, 16, v28
+; GFX10-NEXT:    v_cndmask_b32_e64 v28, v28, v29, s41
+; GFX10-NEXT:    v_cndmask_b32_e64 v29, v26, v27, s40
+; GFX10-NEXT:    v_lshrrev_b32_e32 v27, 16, v27
+; GFX10-NEXT:    v_lshrrev_b32_e32 v26, 16, v26
+; GFX10-NEXT:    v_cndmask_b32_e64 v26, v26, v27, s29
+; GFX10-NEXT:    v_cndmask_b32_e64 v27, v24, v25, s28
+; GFX10-NEXT:    v_lshrrev_b32_e32 v25, 16, v25
+; GFX10-NEXT:    v_lshrrev_b32_e32 v24, 16, v24
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v24, v25, s27
+; GFX10-NEXT:    v_cndmask_b32_e64 v25, v22, v23, s26
+; GFX10-NEXT:    v_lshrrev_b32_e32 v23, 16, v23
+; GFX10-NEXT:    v_lshrrev_b32_e32 v22, 16, v22
+; GFX10-NEXT:    v_cndmask_b32_e64 v22, v22, v23, s25
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v20, v21, s24
+; GFX10-NEXT:    v_lshrrev_b32_e32 v21, 16, v21
+; GFX10-NEXT:    v_lshrrev_b32_e32 v20, 16, v20
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v20, v21, s23
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v18, v19, s22
 ; GFX10-NEXT:    v_lshrrev_b32_e32 v19, 16, v19
-; GFX10-NEXT:    v_cndmask_b32_e32 v11, v36, v37, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v7
-; GFX10-NEXT:    v_cndmask_b32_e32 v7, v49, v48, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v3
-; GFX10-NEXT:    v_cndmask_b32_e32 v3, v32, v33, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
-; GFX10-NEXT:    v_cndmask_b32_e32 v1, v19, v31, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v5
-; GFX10-NEXT:    v_perm_b32 v0, v1, v0, 0x5040100
-; GFX10-NEXT:    v_cndmask_b32_e32 v5, v34, v50, vcc_lo
-; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v9
-; GFX10-NEXT:    v_perm_b32 v1, v3, v2, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v3, v7, v6, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v6, v30, v12, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v2, v5, v4, 0x5040100
-; GFX10-NEXT:    v_cndmask_b32_e32 v9, v38, v39, vcc_lo
-; GFX10-NEXT:    v_perm_b32 v5, v11, v10, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v7, v15, v14, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v10, v21, v20, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v11, v22, v23, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v4, v9, v8, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v8, v17, v16, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v9, v13, v18, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v12, v24, v25, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v13, v26, v27, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v14, v28, v29, 0x5040100
-; GFX10-NEXT:    v_perm_b32 v15, v35, v54, 0x5040100
+; GFX10-NEXT:    v_lshrrev_b32_e32 v18, 16, v18
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v18, v19, s21
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v16, v17, s20
+; GFX10-NEXT:    v_lshrrev_b32_e32 v17, 16, v17
+; GFX10-NEXT:    v_lshrrev_b32_e32 v16, 16, v16
+; GFX10-NEXT:    v_cndmask_b32_e64 v16, v16, v17, s19
+; GFX10-NEXT:    v_cndmask_b32_e64 v17, v14, v15, s18
+; GFX10-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
+; GFX10-NEXT:    v_lshrrev_b32_e32 v14, 16, v14
+; GFX10-NEXT:    v_cndmask_b32_e64 v14, v14, v15, s17
+; GFX10-NEXT:    v_cndmask_b32_e64 v15, v12, v13, s16
+; GFX10-NEXT:    v_lshrrev_b32_e32 v13, 16, v13
+; GFX10-NEXT:    v_lshrrev_b32_e32 v12, 16, v12
+; GFX10-NEXT:    v_cndmask_b32_e64 v12, v12, v13, s15
+; GFX10-NEXT:    v_cndmask_b32_e64 v13, v10, v11, s14
+; GFX10-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
+; GFX10-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
+; GFX10-NEXT:    v_cndmask_b32_e64 v10, v10, v11, s13
+; GFX10-NEXT:    v_cndmask_b32_e64 v11, v8, v9, s12
+; GFX10-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
+; GFX10-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, v8, v9, s11
+; GFX10-NEXT:    v_cndmask_b32_e64 v9, v6, v7, s10
+; GFX10-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
+; GFX10-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v6, v7, s9
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v4, v5, s8
+; GFX10-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
+; GFX10-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v4, v5, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v2, v3, s6
+; GFX10-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
+; GFX10-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v2, v3, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v0, v1, s4
+; GFX10-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GFX10-NEXT:    v_lshrrev_b32_e32 v0, 16, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc_lo
+; GFX10-NEXT:    v_perm_b32 v1, v2, v5, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v2, v4, v7, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v4, v8, v11, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v5, v10, v13, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v0, v0, v3, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v3, v6, v9, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v6, v12, v15, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v7, v14, v17, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v8, v16, v19, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v9, v18, v21, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v10, v20, v23, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v11, v22, v25, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v12, v24, v27, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v13, v26, v29, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v14, v28, v31, 0x5040100
+; GFX10-NEXT:    v_perm_b32 v15, v30, v32, 0x5040100
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11TRUE16-LABEL: v_vselect_v32bf16:
@@ -45500,15 +45480,14 @@ define <4 x bfloat> @v_fmuladd_v4bf16(<4 x bfloat> %a, <4 x bfloat> %b, <4 x bfl
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11TRUE16-NEXT:    v_add3_u32 v7, v7, v3, 0x7fff
-; GFX11TRUE16-NEXT:    v_and_b32_e32 v1, 0xffff0000, v1
+; GFX11TRUE16-NEXT:    v_dual_cndmask_b32 v0, v9, v11 :: v_dual_and_b32 v1, 0xffff0000, v1
 ; GFX11TRUE16-NEXT:    v_and_b32_e32 v6, 0xffff0000, v6
-; GFX11TRUE16-NEXT:    v_cndmask_b32_e32 v0, v9, v11, vcc_lo
 ; GFX11TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11TRUE16-NEXT:    v_dual_add_f32 v1, v1, v5 :: v_dual_add_f32 v2, v6, v8
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11TRUE16-NEXT:    v_dual_add_f32 v1, v1, v5 :: v_dual_and_b32 v0, 0xffff0000, v0
+; GFX11TRUE16-NEXT:    v_add_f32_e32 v2, v6, v8
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v6, 0x400000, v3
-; GFX11TRUE16-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
-; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
 ; GFX11TRUE16-NEXT:    v_bfe_u32 v5, v1, 16, 1
 ; GFX11TRUE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
 ; GFX11TRUE16-NEXT:    v_dual_cndmask_b32 v3, v7, v6 :: v_dual_lshlrev_b32 v6, 16, v4
@@ -45559,14 +45538,14 @@ define <4 x bfloat> @v_fmuladd_v4bf16(<4 x bfloat> %a, <4 x bfloat> %b, <4 x bfl
 ; GFX11FAKE16-NEXT:    v_dual_mul_f32 v6, v7, v6 :: v_dual_and_b32 v5, 0xffff0000, v5
 ; GFX11FAKE16-NEXT:    v_lshlrev_b32_e32 v7, 16, v2
 ; GFX11FAKE16-NEXT:    v_dual_mul_f32 v1, v1, v3 :: v_dual_and_b32 v2, 0xffff0000, v2
-; GFX11FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
 ; GFX11FAKE16-NEXT:    v_bfe_u32 v10, v6, 16, 1
-; GFX11FAKE16-NEXT:    v_mul_f32_e32 v7, v9, v7
 ; GFX11FAKE16-NEXT:    v_or_b32_e32 v3, 0x400000, v6
 ; GFX11FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX11FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11FAKE16-NEXT:    v_mul_f32_e32 v7, v9, v7
 ; GFX11FAKE16-NEXT:    v_add3_u32 v10, v10, v6, 0x7fff
 ; GFX11FAKE16-NEXT:    v_or_b32_e32 v6, 0x400000, v1
+; GFX11FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX11FAKE16-NEXT:    v_bfe_u32 v9, v7, 16, 1
 ; GFX11FAKE16-NEXT:    v_dual_cndmask_b32 v3, v10, v3 :: v_dual_mul_f32 v0, v0, v2
 ; GFX11FAKE16-NEXT:    v_bfe_u32 v2, v1, 16, 1
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
index e7f48435f0ad2..272765c317103 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
@@ -8344,11 +8344,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fadd_ret_v2bf16__offset__amdgpu
 ; GFX11-NEXT:    v_dual_mov_b32 v1, v0 :: v_dual_mov_b32 v0, s16
 ; GFX11-NEXT:    s_add_i32 s4, s16, 0x400
 ; GFX11-NEXT:    s_mov_b32 s5, 0
-; GFX11-NEXT:    v_mov_b32_e32 v4, s4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX11-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen offset:1024
-; GFX11-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX11-NEXT:    s_set_inst_prefetch_distance 0x1
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB26_1: ; %atomicrmw.start
@@ -9761,11 +9760,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fadd_ret_v2bf16__offset(ptr add
 ; GFX11-NEXT:    v_dual_mov_b32 v1, v0 :: v_dual_mov_b32 v0, s16
 ; GFX11-NEXT:    s_add_i32 s4, s16, 0x400
 ; GFX11-NEXT:    s_mov_b32 s5, 0
-; GFX11-NEXT:    v_mov_b32_e32 v4, s4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX11-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen offset:1024
-; GFX11-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX11-NEXT:    s_set_inst_prefetch_distance 0x1
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB29_1: ; %atomicrmw.start
@@ -10520,11 +10518,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fadd_ret_v2bf16__offset__amdgpu
 ; GFX11-NEXT:    v_dual_mov_b32 v1, v0 :: v_dual_mov_b32 v0, s16
 ; GFX11-NEXT:    s_add_i32 s4, s16, 0x400
 ; GFX11-NEXT:    s_mov_b32 s5, 0
-; GFX11-NEXT:    v_mov_b32_e32 v4, s4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX11-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen offset:1024
-; GFX11-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX11-NEXT:    s_set_inst_prefetch_distance 0x1
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB31_1: ; %atomicrmw.start
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
index b0447194412d8..5c7aa447116dd 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
@@ -6625,10 +6625,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fmax_ret_v2bf16__offset__amdgpu
 ; GFX12-NEXT:    s_add_co_i32 s4, s16, 0x400
 ; GFX12-NEXT:    s_mov_b32 s5, 0
 ; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    v_mov_b32_e32 v4, s4
-; GFX12-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX12-NEXT:    buffer_load_b32 v0, v0, s[0:3], null offen offset:1024
-; GFX12-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX12-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX12-NEXT:  .LBB19_1: ; %atomicrmw.start
 ; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
@@ -6721,11 +6721,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fmax_ret_v2bf16__offset__amdgpu
 ; GFX11-NEXT:    v_dual_mov_b32 v1, v0 :: v_dual_mov_b32 v0, s16
 ; GFX11-NEXT:    s_add_i32 s4, s16, 0x400
 ; GFX11-NEXT:    s_mov_b32 s5, 0
-; GFX11-NEXT:    v_mov_b32_e32 v4, s4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX11-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen offset:1024
-; GFX11-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX11-NEXT:    s_set_inst_prefetch_distance 0x1
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB19_1: ; %atomicrmw.start
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
index e33c8aa30391d..33937aaa3b06c 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
@@ -6625,10 +6625,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fmin_ret_v2bf16__offset__amdgpu
 ; GFX12-NEXT:    s_add_co_i32 s4, s16, 0x400
 ; GFX12-NEXT:    s_mov_b32 s5, 0
 ; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    v_mov_b32_e32 v4, s4
-; GFX12-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX12-NEXT:    buffer_load_b32 v0, v0, s[0:3], null offen offset:1024
-; GFX12-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX12-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX12-NEXT:  .LBB19_1: ; %atomicrmw.start
 ; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
@@ -6721,11 +6721,10 @@ define <2 x bfloat> @buffer_fat_ptr_agent_atomic_fmin_ret_v2bf16__offset__amdgpu
 ; GFX11-NEXT:    v_dual_mov_b32 v1, v0 :: v_dual_mov_b32 v0, s16
 ; GFX11-NEXT:    s_add_i32 s4, s16, 0x400
 ; GFX11-NEXT:    s_mov_b32 s5, 0
-; GFX11-NEXT:    v_mov_b32_e32 v4, s4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_and_b32 v3, 0xffff0000, v1
 ; GFX11-NEXT:    buffer_load_b32 v0, v0, s[0:3], 0 offen offset:1024
-; GFX11-NEXT:    v_and_b32_e32 v3, 0xffff0000, v1
+; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 16, v1
 ; GFX11-NEXT:    s_set_inst_prefetch_distance 0x1
 ; GFX11-NEXT:    .p2align 6
 ; GFX11-NEXT:  .LBB19_1: ; %atomicrmw.start
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll
index ffa9b465af0dd..c3b0ecff3f723 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll
@@ -1254,12 +1254,12 @@ define amdgpu_kernel void @memcpy_known_small(ptr addrspace(7) inreg %src, ptr a
 ; SDAG-GFX1100-NEXT:    s_load_b128 s[0:3], s[4:5], 0x44
 ; SDAG-GFX1100-NEXT:    s_mov_b32 s5, s12
 ; SDAG-GFX1100-NEXT:    s_waitcnt lgkmcnt(0)
-; SDAG-GFX1100-NEXT:    v_mov_b32_e32 v5, s0
 ; SDAG-GFX1100-NEXT:    s_mov_b32 s4, s3
-; SDAG-GFX1100-NEXT:    s_mov_b32 s3, s12
+; SDAG-GFX1100-NEXT:    v_mov_b32_e32 v5, s0
 ; SDAG-GFX1100-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; SDAG-GFX1100-NEXT:    s_mov_b32 s13, s2
 ; SDAG-GFX1100-NEXT:    s_mov_b32 s2, s1
+; SDAG-GFX1100-NEXT:    s_mov_b32 s3, s12
 ; SDAG-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; SDAG-GFX1100-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; SDAG-GFX1100-NEXT:    s_waitcnt vmcnt(0)
@@ -1325,12 +1325,12 @@ define amdgpu_kernel void @memcpy_known_small(ptr addrspace(7) inreg %src, ptr a
 ; GISEL-GFX1100-NEXT:    s_load_b32 s7, s[4:5], 0x54
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s4, s13
 ; GISEL-GFX1100-NEXT:    s_waitcnt lgkmcnt(0)
-; GISEL-GFX1100-NEXT:    v_mov_b32_e32 v5, s8
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s12, s9
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s5, s10
-; GISEL-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GISEL-GFX1100-NEXT:    v_mov_b32_e32 v5, s8
 ; GISEL-GFX1100-NEXT:    s_or_b64 s[4:5], s[12:13], s[4:5]
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s12, s11
+; GISEL-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GISEL-GFX1100-NEXT:    s_or_b64 s[6:7], s[12:13], s[6:7]
 ; GISEL-GFX1100-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-GFX1100-NEXT:    buffer_store_b128 v[0:3], v5, s[4:7], 0 offen
diff --git a/llvm/test/CodeGen/AMDGPU/call-argument-types.ll b/llvm/test/CodeGen/AMDGPU/call-argument-types.ll
index 178b138b57141..acf2f8add7670 100644
--- a/llvm/test/CodeGen/AMDGPU/call-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/call-argument-types.ll
@@ -6064,8 +6064,8 @@ define void @stack_12xv3i32() #0 {
 ; GFX11-NEXT:    s_add_i32 s0, s32, 16
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s32
 ; GFX11-NEXT:    scratch_store_b32 off, v4, s0
-; GFX11-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v0, 0
-; GFX11-NEXT:    v_dual_mov_b32 v3, 1 :: v_dual_mov_b32 v2, 0
+; GFX11-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v3, 1
+; GFX11-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v2, 0
 ; GFX11-NEXT:    v_dual_mov_b32 v5, 1 :: v_dual_mov_b32 v4, 1
 ; GFX11-NEXT:    v_dual_mov_b32 v7, 2 :: v_dual_mov_b32 v6, 2
 ; GFX11-NEXT:    v_dual_mov_b32 v9, 3 :: v_dual_mov_b32 v8, 2
@@ -6772,10 +6772,10 @@ define void @stack_8xv5i32() #0 {
 ; GFX11-NEXT:    s_add_i32 s1, s32, 16
 ; GFX11-NEXT:    v_writelane_b32 v40, s30, 0
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s32
+; GFX11-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v3, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    scratch_store_b32 off, v8, s0
 ; GFX11-NEXT:    scratch_store_b128 off, v[4:7], s1
-; GFX11-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v3, 0
 ; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v5, 1
 ; GFX11-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 1
 ; GFX11-NEXT:    v_dual_mov_b32 v6, 1 :: v_dual_mov_b32 v9, 1
@@ -7150,21 +7150,21 @@ define void @stack_8xv5f32() #0 {
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s32
 ; GFX11-NEXT:    scratch_store_b32 off, v8, s0
 ; GFX11-NEXT:    scratch_store_b128 off, v[4:7], s1
-; GFX11-NEXT:    v_mov_b32_e32 v6, 1.0
 ; GFX11-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0
 ; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v3, 0
 ; GFX11-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v5, 1.0
-; GFX11-NEXT:    v_dual_mov_b32 v7, 1.0 :: v_dual_mov_b32 v8, 1.0
-; GFX11-NEXT:    v_dual_mov_b32 v11, 2.0 :: v_dual_mov_b32 v10, 2.0
-; GFX11-NEXT:    v_dual_mov_b32 v13, 2.0 :: v_dual_mov_b32 v12, 2.0
-; GFX11-NEXT:    v_dual_mov_b32 v15, 0x40400000 :: v_dual_mov_b32 v14, 2.0
-; GFX11-NEXT:    v_dual_mov_b32 v17, 0x40400000 :: v_dual_mov_b32 v16, 0x40400000
-; GFX11-NEXT:    v_dual_mov_b32 v19, 0x40400000 :: v_dual_mov_b32 v18, 0x40400000
-; GFX11-NEXT:    v_dual_mov_b32 v21, 4.0 :: v_dual_mov_b32 v20, 4.0
-; GFX11-NEXT:    v_dual_mov_b32 v23, 4.0 :: v_dual_mov_b32 v22, 4.0
-; GFX11-NEXT:    v_dual_mov_b32 v25, 0x40a00000 :: v_dual_mov_b32 v24, 4.0
-; GFX11-NEXT:    v_dual_mov_b32 v27, 0x40a00000 :: v_dual_mov_b32 v26, 0x40a00000
-; GFX11-NEXT:    v_dual_mov_b32 v29, 0x40a00000 :: v_dual_mov_b32 v28, 0x40a00000
+; GFX11-NEXT:    v_dual_mov_b32 v6, 1.0 :: v_dual_mov_b32 v7, 1.0
+; GFX11-NEXT:    v_dual_mov_b32 v8, 1.0 :: v_dual_mov_b32 v11, 2.0
+; GFX11-NEXT:    v_dual_mov_b32 v10, 2.0 :: v_dual_mov_b32 v13, 2.0
+; GFX11-NEXT:    v_dual_mov_b32 v12, 2.0 :: v_dual_mov_b32 v15, 0x40400000
+; GFX11-NEXT:    v_dual_mov_b32 v14, 2.0 :: v_dual_mov_b32 v17, 0x40400000
+; GFX11-NEXT:    v_dual_mov_b32 v16, 0x40400000 :: v_dual_mov_b32 v19, 0x40400000
+; GFX11-NEXT:    v_dual_mov_b32 v18, 0x40400000 :: v_dual_mov_b32 v21, 4.0
+; GFX11-NEXT:    v_dual_mov_b32 v20, 4.0 :: v_dual_mov_b32 v23, 4.0
+; GFX11-NEXT:    v_dual_mov_b32 v22, 4.0 :: v_dual_mov_b32 v25, 0x40a00000
+; GFX11-NEXT:    v_dual_mov_b32 v24, 4.0 :: v_dual_mov_b32 v27, 0x40a00000
+; GFX11-NEXT:    v_dual_mov_b32 v26, 0x40a00000 :: v_dual_mov_b32 v29, 0x40a00000
+; GFX11-NEXT:    v_mov_b32_e32 v28, 0x40a00000
 ; GFX11-NEXT:    v_mov_b32_e32 v30, 0x40c00000
 ; GFX11-NEXT:    s_getpc_b64 s[0:1]
 ; GFX11-NEXT:    s_add_u32 s0, s0, external_void_func_8xv5f32 at rel32@lo+4
diff --git a/llvm/test/CodeGen/AMDGPU/carryout-selection.ll b/llvm/test/CodeGen/AMDGPU/carryout-selection.ll
index aabcd69c88ca3..d0ae30f813a72 100644
--- a/llvm/test/CodeGen/AMDGPU/carryout-selection.ll
+++ b/llvm/test/CodeGen/AMDGPU/carryout-selection.ll
@@ -2847,11 +2847,10 @@ define amdgpu_kernel void @sudiv64(ptr addrspace(1) %out, i64 %x, i64 %y) {
 ; GFX11-NEXT:    s_addc_u32 s12, s5, 0
 ; GFX11-NEXT:    s_add_u32 s13, s1, 2
 ; GFX11-NEXT:    s_addc_u32 s14, s5, 0
-; GFX11-NEXT:    v_mov_b32_e32 v2, s13
 ; GFX11-NEXT:    s_cmp_lg_u32 s7, 0
 ; GFX11-NEXT:    v_cmp_le_u32_e32 vcc_lo, s2, v0
 ; GFX11-NEXT:    s_subb_u32 s0, s11, s0
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    v_mov_b32_e32 v2, s13
 ; GFX11-NEXT:    s_cmp_ge_u32 s0, s3
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, 0, -1, vcc_lo
 ; GFX11-NEXT:    s_cselect_b32 s7, -1, 0
diff --git a/llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll b/llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll
index 4fe11760e71fd..7eb69324d1bc3 100644
--- a/llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll
+++ b/llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll
@@ -912,38 +912,36 @@ define amdgpu_kernel void @v_ctlz_zero_undef_i64_with_select(ptr addrspace(1) no
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b32 s8, s6
 ; SI-NEXT:    s_mov_b32 s9, s7
-; SI-NEXT:    buffer_load_ubyte v0, off, s[8:11], 0 offset:5
-; SI-NEXT:    buffer_load_ubyte v1, off, s[8:11], 0 offset:7
-; SI-NEXT:    buffer_load_ubyte v2, off, s[8:11], 0
-; SI-NEXT:    buffer_load_ubyte v3, off, s[8:11], 0 offset:1
-; SI-NEXT:    buffer_load_ubyte v4, off, s[8:11], 0 offset:2
-; SI-NEXT:    buffer_load_ubyte v5, off, s[8:11], 0 offset:3
+; SI-NEXT:    buffer_load_ubyte v0, off, s[8:11], 0
+; SI-NEXT:    buffer_load_ubyte v1, off, s[8:11], 0 offset:1
+; SI-NEXT:    buffer_load_ubyte v2, off, s[8:11], 0 offset:2
+; SI-NEXT:    buffer_load_ubyte v3, off, s[8:11], 0 offset:3
+; SI-NEXT:    buffer_load_ubyte v4, off, s[8:11], 0 offset:5
+; SI-NEXT:    buffer_load_ubyte v5, off, s[8:11], 0 offset:7
 ; SI-NEXT:    buffer_load_ubyte v6, off, s[8:11], 0 offset:4
 ; SI-NEXT:    buffer_load_ubyte v7, off, s[8:11], 0 offset:6
 ; SI-NEXT:    s_mov_b32 s0, s4
 ; SI-NEXT:    s_mov_b32 s1, s5
-; SI-NEXT:    s_waitcnt vmcnt(7)
-; SI-NEXT:    v_lshlrev_b32_e32 v0, 8, v0
-; SI-NEXT:    s_waitcnt vmcnt(6)
-; SI-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; SI-NEXT:    s_waitcnt vmcnt(4)
-; SI-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; SI-NEXT:    s_waitcnt vmcnt(3)
+; SI-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
 ; SI-NEXT:    s_waitcnt vmcnt(2)
 ; SI-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; SI-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; SI-NEXT:    s_waitcnt vmcnt(1)
-; SI-NEXT:    v_or_b32_e32 v0, v0, v6
+; SI-NEXT:    v_or_b32_e32 v4, v4, v6
 ; SI-NEXT:    s_waitcnt vmcnt(0)
-; SI-NEXT:    v_or_b32_e32 v1, v1, v7
-; SI-NEXT:    v_or_b32_e32 v2, v3, v2
-; SI-NEXT:    v_or_b32_e32 v3, v5, v4
-; SI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; SI-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; SI-NEXT:    v_or_b32_e32 v5, v5, v7
 ; SI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; SI-NEXT:    v_or_b32_e32 v1, v3, v2
-; SI-NEXT:    v_ffbh_u32_e32 v1, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v2, 16, v5
+; SI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; SI-NEXT:    v_or_b32_e32 v2, v2, v4
+; SI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; SI-NEXT:    v_ffbh_u32_e32 v0, v0
-; SI-NEXT:    v_add_i32_e32 v1, vcc, 32, v1
-; SI-NEXT:    v_min_u32_e32 v0, v1, v0
+; SI-NEXT:    v_ffbh_u32_e32 v1, v2
+; SI-NEXT:    v_add_i32_e32 v0, vcc, 32, v0
+; SI-NEXT:    v_min_u32_e32 v0, v0, v1
 ; SI-NEXT:    v_min_u32_e32 v0, 64, v0
 ; SI-NEXT:    v_mov_b32_e32 v1, 0
 ; SI-NEXT:    buffer_store_dwordx2 v[0:1], off, s[0:3], 0
@@ -974,46 +972,46 @@ define amdgpu_kernel void @v_ctlz_zero_undef_i64_with_select(ptr addrspace(1) no
 ; VI-NEXT:    v_mov_b32_e32 v9, s5
 ; VI-NEXT:    v_mov_b32_e32 v8, s4
 ; VI-NEXT:    s_add_u32 s4, s2, 2
-; VI-NEXT:    s_addc_u32 s5, s3, 0
-; VI-NEXT:    v_mov_b32_e32 v11, s5
-; VI-NEXT:    v_mov_b32_e32 v10, s4
-; VI-NEXT:    s_add_u32 s4, s2, 1
-; VI-NEXT:    flat_load_ubyte v12, v[0:1]
-; VI-NEXT:    flat_load_ubyte v13, v[2:3]
-; VI-NEXT:    flat_load_ubyte v4, v[4:5]
-; VI-NEXT:    flat_load_ubyte v5, v[6:7]
+; VI-NEXT:    flat_load_ubyte v10, v[0:1]
+; VI-NEXT:    flat_load_ubyte v11, v[2:3]
+; VI-NEXT:    flat_load_ubyte v12, v[4:5]
+; VI-NEXT:    flat_load_ubyte v6, v[6:7]
+; VI-NEXT:    flat_load_ubyte v7, v[8:9]
 ; VI-NEXT:    s_addc_u32 s5, s3, 0
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
-; VI-NEXT:    flat_load_ubyte v6, v[8:9]
-; VI-NEXT:    v_mov_b32_e32 v2, s2
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
-; VI-NEXT:    v_mov_b32_e32 v3, s3
-; VI-NEXT:    flat_load_ubyte v7, v[10:11]
+; VI-NEXT:    s_add_u32 s4, s2, 1
+; VI-NEXT:    s_addc_u32 s5, s3, 0
+; VI-NEXT:    v_mov_b32_e32 v2, s4
+; VI-NEXT:    v_mov_b32_e32 v3, s5
+; VI-NEXT:    v_mov_b32_e32 v5, s3
+; VI-NEXT:    v_mov_b32_e32 v4, s2
 ; VI-NEXT:    flat_load_ubyte v0, v[0:1]
 ; VI-NEXT:    flat_load_ubyte v2, v[2:3]
+; VI-NEXT:    flat_load_ubyte v3, v[4:5]
 ; VI-NEXT:    v_mov_b32_e32 v1, 0
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b32_e32 v3, 8, v12
+; VI-NEXT:    v_lshlrev_b32_e32 v4, 8, v10
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_e32 v3, v3, v13
+; VI-NEXT:    v_or_b32_e32 v4, v4, v11
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
+; VI-NEXT:    v_lshlrev_b32_e32 v5, 8, v12
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_e32 v3, v4, v3
-; VI-NEXT:    v_ffbh_u32_e32 v3, v3
+; VI-NEXT:    v_or_b32_sdwa v5, v5, v6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_e32 v4, v5, v4
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b32_e32 v4, 8, v6
+; VI-NEXT:    v_lshlrev_b32_e32 v5, 8, v7
+; VI-NEXT:    v_ffbh_u32_e32 v4, v4
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v7 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v0, v5, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b32_e32 v0, 8, v0
+; VI-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_e32 v2, v2, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v0, v2
-; VI-NEXT:    v_or_b32_e32 v0, v4, v0
 ; VI-NEXT:    v_ffbh_u32_e32 v0, v0
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, 32, v0
-; VI-NEXT:    v_min_u32_e32 v0, v0, v3
+; VI-NEXT:    v_min_u32_e32 v0, v0, v4
 ; VI-NEXT:    v_mov_b32_e32 v3, s1
 ; VI-NEXT:    v_min_u32_e32 v0, 64, v0
 ; VI-NEXT:    v_mov_b32_e32 v2, s0
diff --git a/llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll b/llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll
index 9fcfbba6fb235..3005c467aa796 100644
--- a/llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll
+++ b/llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll
@@ -876,38 +876,36 @@ define amdgpu_kernel void @v_cttz_zero_undef_i64_with_select(ptr addrspace(1) no
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b32 s8, s6
 ; SI-NEXT:    s_mov_b32 s9, s7
-; SI-NEXT:    buffer_load_ubyte v0, off, s[8:11], 0 offset:5
-; SI-NEXT:    buffer_load_ubyte v1, off, s[8:11], 0 offset:7
-; SI-NEXT:    buffer_load_ubyte v2, off, s[8:11], 0
-; SI-NEXT:    buffer_load_ubyte v3, off, s[8:11], 0 offset:1
-; SI-NEXT:    buffer_load_ubyte v4, off, s[8:11], 0 offset:2
-; SI-NEXT:    buffer_load_ubyte v5, off, s[8:11], 0 offset:3
+; SI-NEXT:    buffer_load_ubyte v0, off, s[8:11], 0
+; SI-NEXT:    buffer_load_ubyte v1, off, s[8:11], 0 offset:1
+; SI-NEXT:    buffer_load_ubyte v2, off, s[8:11], 0 offset:2
+; SI-NEXT:    buffer_load_ubyte v3, off, s[8:11], 0 offset:3
+; SI-NEXT:    buffer_load_ubyte v4, off, s[8:11], 0 offset:5
+; SI-NEXT:    buffer_load_ubyte v5, off, s[8:11], 0 offset:7
 ; SI-NEXT:    buffer_load_ubyte v6, off, s[8:11], 0 offset:4
 ; SI-NEXT:    buffer_load_ubyte v7, off, s[8:11], 0 offset:6
 ; SI-NEXT:    s_mov_b32 s0, s4
 ; SI-NEXT:    s_mov_b32 s1, s5
-; SI-NEXT:    s_waitcnt vmcnt(7)
-; SI-NEXT:    v_lshlrev_b32_e32 v0, 8, v0
-; SI-NEXT:    s_waitcnt vmcnt(6)
-; SI-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
-; SI-NEXT:    s_waitcnt vmcnt(4)
-; SI-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; SI-NEXT:    s_waitcnt vmcnt(3)
+; SI-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
 ; SI-NEXT:    s_waitcnt vmcnt(2)
 ; SI-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
+; SI-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
 ; SI-NEXT:    s_waitcnt vmcnt(1)
-; SI-NEXT:    v_or_b32_e32 v0, v0, v6
+; SI-NEXT:    v_or_b32_e32 v4, v4, v6
 ; SI-NEXT:    s_waitcnt vmcnt(0)
-; SI-NEXT:    v_or_b32_e32 v1, v1, v7
-; SI-NEXT:    v_or_b32_e32 v2, v3, v2
-; SI-NEXT:    v_or_b32_e32 v3, v5, v4
-; SI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; SI-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; SI-NEXT:    v_or_b32_e32 v5, v5, v7
 ; SI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; SI-NEXT:    v_or_b32_e32 v1, v3, v2
-; SI-NEXT:    v_ffbl_b32_e32 v1, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v2, 16, v5
+; SI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; SI-NEXT:    v_or_b32_e32 v2, v2, v4
+; SI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; SI-NEXT:    v_ffbl_b32_e32 v0, v0
-; SI-NEXT:    v_add_i32_e32 v0, vcc, 32, v0
-; SI-NEXT:    v_min_u32_e32 v0, v0, v1
+; SI-NEXT:    v_ffbl_b32_e32 v1, v2
+; SI-NEXT:    v_add_i32_e32 v1, vcc, 32, v1
+; SI-NEXT:    v_min_u32_e32 v0, v1, v0
 ; SI-NEXT:    v_min_u32_e32 v0, 64, v0
 ; SI-NEXT:    v_mov_b32_e32 v1, 0
 ; SI-NEXT:    buffer_store_dwordx2 v[0:1], off, s[0:3], 0
@@ -938,46 +936,46 @@ define amdgpu_kernel void @v_cttz_zero_undef_i64_with_select(ptr addrspace(1) no
 ; VI-NEXT:    v_mov_b32_e32 v9, s5
 ; VI-NEXT:    v_mov_b32_e32 v8, s4
 ; VI-NEXT:    s_add_u32 s4, s2, 2
-; VI-NEXT:    s_addc_u32 s5, s3, 0
-; VI-NEXT:    v_mov_b32_e32 v11, s5
-; VI-NEXT:    v_mov_b32_e32 v10, s4
-; VI-NEXT:    flat_load_ubyte v12, v[0:1]
-; VI-NEXT:    flat_load_ubyte v13, v[2:3]
-; VI-NEXT:    flat_load_ubyte v4, v[4:5]
-; VI-NEXT:    flat_load_ubyte v5, v[6:7]
-; VI-NEXT:    s_add_u32 s4, s2, 1
-; VI-NEXT:    flat_load_ubyte v6, v[8:9]
+; VI-NEXT:    flat_load_ubyte v10, v[0:1]
+; VI-NEXT:    flat_load_ubyte v11, v[2:3]
+; VI-NEXT:    flat_load_ubyte v12, v[4:5]
+; VI-NEXT:    flat_load_ubyte v6, v[6:7]
+; VI-NEXT:    flat_load_ubyte v7, v[8:9]
 ; VI-NEXT:    s_addc_u32 s5, s3, 0
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
-; VI-NEXT:    v_mov_b32_e32 v2, s2
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
-; VI-NEXT:    v_mov_b32_e32 v3, s3
-; VI-NEXT:    flat_load_ubyte v7, v[10:11]
+; VI-NEXT:    s_add_u32 s4, s2, 1
+; VI-NEXT:    s_addc_u32 s5, s3, 0
+; VI-NEXT:    v_mov_b32_e32 v2, s4
+; VI-NEXT:    v_mov_b32_e32 v3, s5
+; VI-NEXT:    v_mov_b32_e32 v5, s3
+; VI-NEXT:    v_mov_b32_e32 v4, s2
 ; VI-NEXT:    flat_load_ubyte v0, v[0:1]
 ; VI-NEXT:    flat_load_ubyte v2, v[2:3]
+; VI-NEXT:    flat_load_ubyte v3, v[4:5]
 ; VI-NEXT:    v_mov_b32_e32 v1, 0
 ; VI-NEXT:    s_waitcnt vmcnt(7)
-; VI-NEXT:    v_lshlrev_b32_e32 v3, 8, v12
+; VI-NEXT:    v_lshlrev_b32_e32 v4, 8, v10
 ; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_or_b32_e32 v3, v3, v13
+; VI-NEXT:    v_or_b32_e32 v4, v4, v11
 ; VI-NEXT:    s_waitcnt vmcnt(5)
-; VI-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
+; VI-NEXT:    v_lshlrev_b32_e32 v5, 8, v12
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; VI-NEXT:    v_or_b32_e32 v3, v4, v3
+; VI-NEXT:    v_or_b32_sdwa v5, v5, v6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_e32 v4, v5, v4
 ; VI-NEXT:    s_waitcnt vmcnt(3)
-; VI-NEXT:    v_lshlrev_b32_e32 v4, 8, v6
-; VI-NEXT:    v_ffbl_b32_e32 v3, v3
-; VI-NEXT:    v_add_u32_e32 v3, vcc, 32, v3
+; VI-NEXT:    v_lshlrev_b32_e32 v5, 8, v7
+; VI-NEXT:    v_ffbl_b32_e32 v4, v4
+; VI-NEXT:    v_add_u32_e32 v4, vcc, 32, v4
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_or_b32_sdwa v4, v4, v7 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; VI-NEXT:    v_or_b32_sdwa v0, v5, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_lshlrev_b32_e32 v0, 8, v0
+; VI-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_or_b32_e32 v2, v2, v3
 ; VI-NEXT:    v_or_b32_e32 v0, v0, v2
-; VI-NEXT:    v_or_b32_e32 v0, v4, v0
 ; VI-NEXT:    v_ffbl_b32_e32 v0, v0
-; VI-NEXT:    v_min_u32_e32 v0, v3, v0
+; VI-NEXT:    v_min_u32_e32 v0, v4, v0
 ; VI-NEXT:    v_mov_b32_e32 v3, s1
 ; VI-NEXT:    v_min_u32_e32 v0, 64, v0
 ; VI-NEXT:    v_mov_b32_e32 v2, s0
diff --git a/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll b/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
index d1090738e24a6..745e047348626 100644
--- a/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
+++ b/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
@@ -1568,10 +1568,10 @@ define amdgpu_kernel void @load_v4i8_to_v4f32_unaligned_multiuse(ptr addrspace(1
 ; GFX9-NEXT:    s_mov_b32 s0, 0x4000405
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    global_load_ubyte v1, v0, s[12:13] offset:2
-; GFX9-NEXT:    global_load_ubyte v2, v0, s[14:15] offset:3
 ; GFX9-NEXT:    global_load_ubyte v3, v0, s[12:13] offset:3
+; GFX9-NEXT:    global_load_ubyte v2, v0, s[14:15] offset:3
 ; GFX9-NEXT:    global_load_ubyte v4, v0, s[14:15] offset:2
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
 ; GFX9-NEXT:    v_lshl_or_b32 v6, v3, 8, v1
 ; GFX9-NEXT:    v_cvt_f32_ubyte0_e32 v1, v1
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -1904,37 +1904,37 @@ define amdgpu_kernel void @load_v7i8_to_v7f32(ptr addrspace(1) noalias %out, ptr
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
 ; VI-NEXT:    v_add_u32_e32 v2, vcc, 5, v0
 ; VI-NEXT:    v_addc_u32_e32 v3, vcc, 0, v1, vcc
-; VI-NEXT:    flat_load_ubyte v10, v[2:3]
-; VI-NEXT:    v_add_u32_e32 v2, vcc, 6, v0
-; VI-NEXT:    v_addc_u32_e32 v3, vcc, 0, v1, vcc
-; VI-NEXT:    v_add_u32_e32 v4, vcc, 1, v0
+; VI-NEXT:    v_add_u32_e32 v4, vcc, 6, v0
 ; VI-NEXT:    v_addc_u32_e32 v5, vcc, 0, v1, vcc
-; VI-NEXT:    v_add_u32_e32 v6, vcc, 2, v0
+; VI-NEXT:    v_add_u32_e32 v6, vcc, 1, v0
 ; VI-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
-; VI-NEXT:    v_add_u32_e32 v8, vcc, 3, v0
+; VI-NEXT:    v_add_u32_e32 v8, vcc, 2, v0
 ; VI-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
-; VI-NEXT:    flat_load_ubyte v6, v[6:7]
-; VI-NEXT:    flat_load_ubyte v7, v[8:9]
-; VI-NEXT:    flat_load_ubyte v8, v[2:3]
-; VI-NEXT:    flat_load_ubyte v2, v[0:1]
+; VI-NEXT:    v_add_u32_e32 v10, vcc, 3, v0
+; VI-NEXT:    v_addc_u32_e32 v11, vcc, 0, v1, vcc
+; VI-NEXT:    flat_load_ubyte v12, v[2:3]
+; VI-NEXT:    flat_load_ubyte v2, v[8:9]
+; VI-NEXT:    flat_load_ubyte v3, v[10:11]
 ; VI-NEXT:    flat_load_ubyte v4, v[4:5]
+; VI-NEXT:    flat_load_ubyte v5, v[0:1]
+; VI-NEXT:    flat_load_ubyte v6, v[6:7]
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, 4, v0
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT:    flat_load_ubyte v9, v[0:1]
+; VI-NEXT:    flat_load_ubyte v7, v[0:1]
 ; VI-NEXT:    s_mov_b32 s3, 0xf000
 ; VI-NEXT:    s_mov_b32 s2, -1
-; VI-NEXT:    s_waitcnt vmcnt(6)
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v5, v10
+; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v2, v2
 ; VI-NEXT:    s_waitcnt vmcnt(4)
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v3, v7
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v3, v3
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v0, v2
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v2, v6
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v0, v5
 ; VI-NEXT:    s_waitcnt vmcnt(1)
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v1, v4
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v6, v8
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v1, v6
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v6, v4
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v5, v12
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_cvt_f32_ubyte0_e32 v4, v9
+; VI-NEXT:    v_cvt_f32_ubyte0_e32 v4, v7
 ; VI-NEXT:    buffer_store_dwordx3 v[4:6], off, s[0:3], 0 offset:16
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[0:3], 0
 ; VI-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/ds-alignment.ll b/llvm/test/CodeGen/AMDGPU/ds-alignment.ll
index b1664c59a7e4c..93422e259b827 100644
--- a/llvm/test/CodeGen/AMDGPU/ds-alignment.ll
+++ b/llvm/test/CodeGen/AMDGPU/ds-alignment.ll
@@ -209,27 +209,28 @@ define amdgpu_kernel void @ds8align1(ptr addrspace(3) %in, ptr addrspace(3) %out
 ; ALIGNED-SDAG-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v0, s0
-; ALIGNED-SDAG-NEXT:    ds_read_u8 v1, v0
-; ALIGNED-SDAG-NEXT:    ds_read_u8 v2, v0 offset:1
-; ALIGNED-SDAG-NEXT:    ds_read_u8 v3, v0 offset:2
-; ALIGNED-SDAG-NEXT:    ds_read_u8 v4, v0 offset:3
+; ALIGNED-SDAG-NEXT:    ds_read_u8 v2, v0
+; ALIGNED-SDAG-NEXT:    ds_read_u8 v3, v0 offset:1
+; ALIGNED-SDAG-NEXT:    ds_read_u8 v4, v0 offset:2
 ; ALIGNED-SDAG-NEXT:    ds_read_u8 v5, v0 offset:4
 ; ALIGNED-SDAG-NEXT:    ds_read_u8 v6, v0 offset:5
+; ALIGNED-SDAG-NEXT:    ds_read_u8 v7, v0 offset:3
 ; ALIGNED-SDAG-NEXT:    ds_read_u8 v8, v0 offset:6
 ; ALIGNED-SDAG-NEXT:    ds_read_u8 v0, v0 offset:7
-; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v7, s1
-; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(3)
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v5 offset:4
-; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(3)
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v6 offset:5
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v1
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v2 offset:1
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v3 offset:2
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v4 offset:3
+; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v1, s1
+; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(4)
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v5 offset:4
+; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(4)
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v6 offset:5
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v2
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v3 offset:1
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v4 offset:2
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v8 offset:6
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v7 offset:3
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
-; ALIGNED-SDAG-NEXT:    ds_write_b8 v7, v0 offset:7
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v8 offset:6
+; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
+; ALIGNED-SDAG-NEXT:    ds_write_b8 v1, v0 offset:7
 ; ALIGNED-SDAG-NEXT:    s_endpgm
 ;
 ; ALIGNED-GISEL-LABEL: ds8align1:
@@ -492,23 +493,24 @@ define amdgpu_kernel void @ds12align2(ptr addrspace(3) %in, ptr addrspace(3) %ou
 ; ALIGNED-SDAG-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v0, s0
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v1, v0 offset:8
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v2, v0
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v3, v0 offset:2
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v3, v0 offset:8
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v4, v0 offset:4
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v5, v0 offset:6
-; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v6, s1
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v5, v0 offset:2
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v6, v0 offset:6
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v0, v0 offset:10
+; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v1, s1
+; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(4)
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v3 offset:8
+; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(4)
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v4 offset:4
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v2
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(5)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v6, v1 offset:8
-; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(3)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v6, v4 offset:4
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v6, v2
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v6, v3 offset:2
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v5 offset:2
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(5)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v6, v5 offset:6
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v6 offset:6
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(5)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v6, v0 offset:10
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v0 offset:10
 ; ALIGNED-SDAG-NEXT:    s_endpgm
 ;
 ; ALIGNED-GISEL-LABEL: ds12align2:
@@ -808,29 +810,27 @@ define amdgpu_kernel void @ds16align2(ptr addrspace(3) %in, ptr addrspace(3) %ou
 ; ALIGNED-SDAG-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v0, s0
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v1, v0 offset:12
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v2, v0
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v3, v0 offset:2
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v4, v0 offset:4
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v5, v0 offset:6
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v6, v0 offset:8
-; ALIGNED-SDAG-NEXT:    ds_read_u16 v7, v0 offset:10
-; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v8, s1
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v5, v0 offset:12
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v6, v0 offset:6
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v7, v0 offset:8
+; ALIGNED-SDAG-NEXT:    ds_read_u16 v8, v0 offset:10
 ; ALIGNED-SDAG-NEXT:    ds_read_u16 v0, v0 offset:14
-; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v1 offset:12
-; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v2
-; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(6)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v4 offset:4
+; ALIGNED-SDAG-NEXT:    v_mov_b32_e32 v1, s1
+; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(4)
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v5 offset:12
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v2
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v4 offset:4
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(5)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v6 offset:8
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v3 offset:2
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v5 offset:6
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v7 offset:8
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v3 offset:2
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v6 offset:6
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v7 offset:10
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v8 offset:10
 ; ALIGNED-SDAG-NEXT:    s_waitcnt lgkmcnt(7)
-; ALIGNED-SDAG-NEXT:    ds_write_b16 v8, v0 offset:14
+; ALIGNED-SDAG-NEXT:    ds_write_b16 v1, v0 offset:14
 ; ALIGNED-SDAG-NEXT:    s_endpgm
 ;
 ; ALIGNED-GISEL-LABEL: ds16align2:
diff --git a/llvm/test/CodeGen/AMDGPU/ds_read2.ll b/llvm/test/CodeGen/AMDGPU/ds_read2.ll
index 06c30dfd36033..d95f528442efd 100644
--- a/llvm/test/CodeGen/AMDGPU/ds_read2.ll
+++ b/llvm/test/CodeGen/AMDGPU/ds_read2.ll
@@ -522,36 +522,39 @@ define amdgpu_kernel void @simple_read2_f32_volatile_1(ptr addrspace(1) %out) #0
 define amdgpu_kernel void @unaligned_read2_f32(ptr addrspace(1) %out, ptr addrspace(3) %lds) #0 {
 ; CI-LABEL: unaligned_read2_f32:
 ; CI:       ; %bb.0:
-; CI-NEXT:    s_load_dword s0, s[4:5], 0x2
+; CI-NEXT:    s_load_dword s2, s[4:5], 0x2
+; CI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
 ; CI-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; CI-NEXT:    s_mov_b32 m0, -1
 ; CI-NEXT:    s_mov_b32 s3, 0xf000
-; CI-NEXT:    s_mov_b32 s2, 0
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    v_add_i32_e32 v1, vcc, s0, v0
-; CI-NEXT:    ds_read_u8 v2, v1 offset:34
-; CI-NEXT:    ds_read_u8 v3, v1 offset:32
-; CI-NEXT:    ds_read_u8 v4, v1 offset:3
+; CI-NEXT:    v_add_i32_e32 v1, vcc, s2, v0
+; CI-NEXT:    ds_read_u8 v2, v1 offset:1
+; CI-NEXT:    ds_read_u8 v3, v1 offset:34
+; CI-NEXT:    ds_read_u8 v4, v1 offset:32
 ; CI-NEXT:    ds_read_u8 v5, v1 offset:2
-; CI-NEXT:    ds_read_u8 v6, v1 offset:1
-; CI-NEXT:    ds_read_u8 v7, v1
+; CI-NEXT:    ds_read_u8 v6, v1
+; CI-NEXT:    ds_read_u8 v7, v1 offset:3
 ; CI-NEXT:    ds_read_u8 v8, v1 offset:33
 ; CI-NEXT:    ds_read_u8 v1, v1 offset:35
-; CI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; CI-NEXT:    s_waitcnt lgkmcnt(7)
+; CI-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
+; CI-NEXT:    s_waitcnt lgkmcnt(3)
+; CI-NEXT:    v_or_b32_e32 v2, v2, v6
+; CI-NEXT:    s_waitcnt lgkmcnt(2)
+; CI-NEXT:    v_lshlrev_b32_e32 v6, 8, v7
+; CI-NEXT:    v_or_b32_e32 v5, v6, v5
+; CI-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; CI-NEXT:    v_lshlrev_b32_e32 v6, 8, v6
-; CI-NEXT:    v_or_b32_e32 v4, v4, v5
 ; CI-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; CI-NEXT:    v_or_b32_e32 v2, v5, v2
 ; CI-NEXT:    v_lshlrev_b32_e32 v5, 8, v8
-; CI-NEXT:    v_or_b32_e32 v1, v1, v2
-; CI-NEXT:    v_or_b32_e32 v6, v6, v7
-; CI-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; CI-NEXT:    v_or_b32_e32 v3, v5, v3
-; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; CI-NEXT:    v_or_b32_e32 v4, v4, v6
 ; CI-NEXT:    v_or_b32_e32 v1, v1, v3
-; CI-NEXT:    v_add_f32_e32 v2, v4, v1
+; CI-NEXT:    v_or_b32_e32 v4, v5, v4
+; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; CI-NEXT:    v_or_b32_e32 v1, v1, v4
+; CI-NEXT:    v_add_f32_e32 v2, v2, v1
+; CI-NEXT:    s_mov_b32 s2, 0
 ; CI-NEXT:    v_mov_b32_e32 v1, 0
 ; CI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; CI-NEXT:    s_endpgm
@@ -612,36 +615,39 @@ define amdgpu_kernel void @unaligned_read2_f32(ptr addrspace(1) %out, ptr addrsp
 define amdgpu_kernel void @unaligned_offset_read2_f32(ptr addrspace(1) %out, ptr addrspace(3) %lds) #0 {
 ; CI-LABEL: unaligned_offset_read2_f32:
 ; CI:       ; %bb.0:
-; CI-NEXT:    s_load_dword s0, s[4:5], 0x2
+; CI-NEXT:    s_load_dword s2, s[4:5], 0x2
+; CI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
 ; CI-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; CI-NEXT:    s_mov_b32 m0, -1
 ; CI-NEXT:    s_mov_b32 s3, 0xf000
-; CI-NEXT:    s_mov_b32 s2, 0
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    v_add_i32_e32 v1, vcc, s0, v0
-; CI-NEXT:    ds_read_u8 v2, v1 offset:11
-; CI-NEXT:    ds_read_u8 v3, v1 offset:9
-; CI-NEXT:    ds_read_u8 v4, v1 offset:8
+; CI-NEXT:    v_add_i32_e32 v1, vcc, s2, v0
+; CI-NEXT:    ds_read_u8 v2, v1 offset:6
+; CI-NEXT:    ds_read_u8 v3, v1 offset:11
+; CI-NEXT:    ds_read_u8 v4, v1 offset:9
 ; CI-NEXT:    ds_read_u8 v5, v1 offset:7
-; CI-NEXT:    ds_read_u8 v6, v1 offset:6
-; CI-NEXT:    ds_read_u8 v7, v1 offset:5
+; CI-NEXT:    ds_read_u8 v6, v1 offset:5
+; CI-NEXT:    ds_read_u8 v7, v1 offset:8
 ; CI-NEXT:    ds_read_u8 v8, v1 offset:10
 ; CI-NEXT:    ds_read_u8 v1, v1 offset:12
-; CI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; CI-NEXT:    s_waitcnt lgkmcnt(7)
+; CI-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
+; CI-NEXT:    s_waitcnt lgkmcnt(3)
+; CI-NEXT:    v_or_b32_e32 v2, v2, v6
+; CI-NEXT:    s_waitcnt lgkmcnt(2)
+; CI-NEXT:    v_lshlrev_b32_e32 v6, 8, v7
+; CI-NEXT:    v_or_b32_e32 v5, v6, v5
+; CI-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
-; CI-NEXT:    v_lshlrev_b32_e32 v6, 8, v6
-; CI-NEXT:    v_or_b32_e32 v4, v4, v5
 ; CI-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; CI-NEXT:    v_or_b32_e32 v2, v5, v2
 ; CI-NEXT:    v_lshlrev_b32_e32 v5, 8, v8
-; CI-NEXT:    v_or_b32_e32 v1, v1, v2
-; CI-NEXT:    v_or_b32_e32 v6, v6, v7
-; CI-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; CI-NEXT:    v_or_b32_e32 v3, v5, v3
-; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; CI-NEXT:    v_or_b32_e32 v4, v4, v6
 ; CI-NEXT:    v_or_b32_e32 v1, v1, v3
-; CI-NEXT:    v_add_f32_e32 v2, v4, v1
+; CI-NEXT:    v_or_b32_e32 v4, v5, v4
+; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; CI-NEXT:    v_or_b32_e32 v1, v1, v4
+; CI-NEXT:    v_add_f32_e32 v2, v2, v1
+; CI-NEXT:    s_mov_b32 s2, 0
 ; CI-NEXT:    v_mov_b32_e32 v1, 0
 ; CI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; CI-NEXT:    s_endpgm
@@ -709,17 +715,17 @@ define amdgpu_kernel void @misaligned_2_simple_read2_f32(ptr addrspace(1) %out,
 ; CI-NEXT:    s_mov_b32 s2, 0
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    v_add_i32_e32 v1, vcc, s0, v0
-; CI-NEXT:    ds_read_u16 v2, v1 offset:32
-; CI-NEXT:    ds_read_u16 v3, v1 offset:2
+; CI-NEXT:    ds_read_u16 v2, v1 offset:2
+; CI-NEXT:    ds_read_u16 v3, v1 offset:32
 ; CI-NEXT:    ds_read_u16 v4, v1
 ; CI-NEXT:    ds_read_u16 v1, v1 offset:34
 ; CI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; CI-NEXT:    v_or_b32_e32 v3, v3, v4
+; CI-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; CI-NEXT:    v_or_b32_e32 v2, v2, v4
 ; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; CI-NEXT:    v_or_b32_e32 v1, v1, v2
-; CI-NEXT:    v_add_f32_e32 v2, v3, v1
+; CI-NEXT:    v_or_b32_e32 v1, v1, v3
+; CI-NEXT:    v_add_f32_e32 v2, v2, v1
 ; CI-NEXT:    v_mov_b32_e32 v1, 0
 ; CI-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; CI-NEXT:    s_endpgm
@@ -1460,17 +1466,17 @@ define amdgpu_kernel void @read2_v2i32_align1_odd_offset(ptr addrspace(1) %out)
 ; CI-NEXT:    v_or_b32_e32 v2, v2, v3
 ; CI-NEXT:    s_waitcnt lgkmcnt(1)
 ; CI-NEXT:    v_or_b32_e32 v1, v1, v4
-; CI-NEXT:    ds_read_u8 v4, v0 offset:67
-; CI-NEXT:    ds_read_u8 v6, v0 offset:66
+; CI-NEXT:    ds_read_u8 v4, v0 offset:66
+; CI-NEXT:    ds_read_u8 v6, v0 offset:67
 ; CI-NEXT:    ds_read_u8 v0, v0 offset:65
 ; CI-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; CI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
 ; CI-NEXT:    v_or_b32_e32 v1, v2, v1
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    v_lshlrev_b32_e32 v2, 8, v6
+; CI-NEXT:    v_lshlrev_b32_e32 v2, 8, v4
 ; CI-NEXT:    v_or_b32_e32 v0, v2, v0
 ; CI-NEXT:    v_lshlrev_b32_e32 v2, 8, v5
-; CI-NEXT:    v_or_b32_e32 v2, v2, v4
+; CI-NEXT:    v_or_b32_e32 v2, v2, v6
 ; CI-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; CI-NEXT:    s_mov_b32 s3, 0xf000
 ; CI-NEXT:    s_mov_b32 s2, -1
@@ -1481,26 +1487,25 @@ define amdgpu_kernel void @read2_v2i32_align1_odd_offset(ptr addrspace(1) %out)
 ; GFX9-ALIGNED-LABEL: read2_v2i32_align1_odd_offset:
 ; GFX9-ALIGNED:       ; %bb.0: ; %entry
 ; GFX9-ALIGNED-NEXT:    v_mov_b32_e32 v2, 0
-; GFX9-ALIGNED-NEXT:    ds_read_u8 v0, v2 offset:70
-; GFX9-ALIGNED-NEXT:    ds_read_u8 v3, v2 offset:65
-; GFX9-ALIGNED-NEXT:    ds_read_u8 v4, v2 offset:66
-; GFX9-ALIGNED-NEXT:    ds_read_u8 v5, v2 offset:67
-; GFX9-ALIGNED-NEXT:    ds_read_u8 v6, v2 offset:68
-; GFX9-ALIGNED-NEXT:    ds_read_u8 v1, v2 offset:69
+; GFX9-ALIGNED-NEXT:    ds_read_u8 v0, v2 offset:65
+; GFX9-ALIGNED-NEXT:    ds_read_u8 v3, v2 offset:66
+; GFX9-ALIGNED-NEXT:    ds_read_u8 v4, v2 offset:67
+; GFX9-ALIGNED-NEXT:    ds_read_u8 v5, v2 offset:68
+; GFX9-ALIGNED-NEXT:    ds_read_u8 v1, v2 offset:70
+; GFX9-ALIGNED-NEXT:    ds_read_u8 v6, v2 offset:69
 ; GFX9-ALIGNED-NEXT:    ds_read_u8 v7, v2 offset:72
 ; GFX9-ALIGNED-NEXT:    ds_read_u8 v8, v2 offset:71
-; GFX9-ALIGNED-NEXT:    s_waitcnt lgkmcnt(7)
-; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v0, 8, v0
 ; GFX9-ALIGNED-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
 ; GFX9-ALIGNED-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v0, v0, v1
-; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v1, 8, v7
-; GFX9-ALIGNED-NEXT:    v_or_b32_sdwa v1, v1, v8 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v1, v1, v0
-; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v0, 8, v4
-; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v0, v0, v3
-; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 8, v6
-; GFX9-ALIGNED-NEXT:    v_or_b32_sdwa v3, v3, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
+; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v1, v1, v6
+; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v6, 8, v7
+; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v0, v3, v0
+; GFX9-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 8, v5
+; GFX9-ALIGNED-NEXT:    v_or_b32_sdwa v6, v6, v8 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; GFX9-ALIGNED-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v1, v6, v1
 ; GFX9-ALIGNED-NEXT:    v_or_b32_e32 v0, v3, v0
 ; GFX9-ALIGNED-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
 ; GFX9-ALIGNED-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll b/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
index 60334e46a4454..52bcaed7ec75a 100644
--- a/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
+++ b/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
@@ -153,14 +153,14 @@ define i32 @global_load_2xi16_align1(ptr addrspace(1) %p) #0 {
 ; GFX7-ALIGNED-NEXT:    v_addc_u32_e32 v5, vcc, 0, v1, vcc
 ; GFX7-ALIGNED-NEXT:    v_add_i32_e32 v6, vcc, 3, v0
 ; GFX7-ALIGNED-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
-; GFX7-ALIGNED-NEXT:    flat_load_ubyte v6, v[6:7]
 ; GFX7-ALIGNED-NEXT:    flat_load_ubyte v4, v[4:5]
+; GFX7-ALIGNED-NEXT:    flat_load_ubyte v5, v[6:7]
 ; GFX7-ALIGNED-NEXT:    flat_load_ubyte v2, v[2:3]
 ; GFX7-ALIGNED-NEXT:    flat_load_ubyte v0, v[0:1]
 ; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(3)
-; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 24, v6
-; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(2)
 ; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v1, 8, v4
+; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(2)
+; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 24, v5
 ; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(1)
 ; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll b/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
index f9694dcd89abf..6f8da57e223e5 100644
--- a/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
+++ b/llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
@@ -204,14 +204,14 @@ define i32 @private_load_2xi16_align1(ptr addrspace(5) %p) #0 {
 ; GFX7-ALIGNED-NEXT:    v_add_i32_e32 v1, vcc, 2, v0
 ; GFX7-ALIGNED-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; GFX7-ALIGNED-NEXT:    v_add_i32_e32 v3, vcc, 3, v0
-; GFX7-ALIGNED-NEXT:    buffer_load_ubyte v3, v3, s[0:3], 0 offen
 ; GFX7-ALIGNED-NEXT:    buffer_load_ubyte v2, v2, s[0:3], 0 offen
+; GFX7-ALIGNED-NEXT:    buffer_load_ubyte v3, v3, s[0:3], 0 offen
 ; GFX7-ALIGNED-NEXT:    buffer_load_ubyte v1, v1, s[0:3], 0 offen
 ; GFX7-ALIGNED-NEXT:    buffer_load_ubyte v0, v0, s[0:3], 0 offen
 ; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(3)
-; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
-; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(2)
 ; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
+; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(2)
+; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v3, 24, v3
 ; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(1)
 ; GFX7-ALIGNED-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX7-ALIGNED-NEXT:    s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll b/llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll
index b4b9c2d3e0135..13c7538475421 100644
--- a/llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll
@@ -1929,25 +1929,25 @@ define amdgpu_kernel void @s_copysign_out_f16_mag_f64_sign_f16(ptr addrspace(1)
 ; GFX11-NEXT:    v_med3_i32 v1, s3, 0, 13
 ; GFX11-NEXT:    v_readfirstlane_b32 s3, v0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, s4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_readfirstlane_b32 s6, v1
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_or_b32 s3, s5, s3
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    s_or_b32 s5, s3, 0x1000
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_lshr_b32 s7, s5, s6
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_lshl_b32 s6, s7, s6
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_cmp_lg_u32 s6, s5
 ; GFX11-NEXT:    s_cselect_b32 s5, 1, 0
 ; GFX11-NEXT:    s_addk_i32 s2, 0xfc10
 ; GFX11-NEXT:    s_or_b32 s5, s7, s5
 ; GFX11-NEXT:    s_lshl_b32 s6, s2, 12
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_or_b32 s6, s3, s6
 ; GFX11-NEXT:    s_cmp_lt_i32 s2, 1
 ; GFX11-NEXT:    s_cselect_b32 s5, s5, s6
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_and_b32 s6, s5, 7
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_cmp_gt_i32 s6, 5
 ; GFX11-NEXT:    s_cselect_b32 s7, 1, 0
 ; GFX11-NEXT:    s_cmp_eq_u32 s6, 3
@@ -2175,16 +2175,18 @@ define amdgpu_kernel void @s_copysign_v3f16(ptr addrspace(1) %arg_out, <3 x half
 ; GFX11-FAKE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x2c
 ; GFX11-FAKE16-NEXT:    s_load_b64 s[4:5], s[4:5], 0x24
 ; GFX11-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v3, 0 :: v_dual_mov_b32 v0, s2
+; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v3, 0 :: v_dual_mov_b32 v2, s3
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, s2
 ; GFX11-FAKE16-NEXT:    s_lshr_b32 s2, s2, 16
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v2, s3 :: v_dual_mov_b32 v1, s2
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v1, s2
+; GFX11-FAKE16-NEXT:    v_bfi_b32 v2, 0x7fff, s1, v2
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_bfi_b32 v0, 0x7fff, s0, v0
 ; GFX11-FAKE16-NEXT:    s_lshr_b32 s0, s0, 16
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-FAKE16-NEXT:    v_bfi_b32 v2, 0x7fff, s1, v2
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instid1(SALU_CYCLE_1)
 ; GFX11-FAKE16-NEXT:    v_bfi_b32 v1, 0x7fff, s0, v1
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GFX11-FAKE16-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; GFX11-FAKE16-NEXT:    s_clause 0x1
diff --git a/llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll b/llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll
index fab45c9dc3bc3..61f5b73033f5e 100644
--- a/llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll
+++ b/llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll
@@ -569,14 +569,13 @@ define amdgpu_kernel void @s_test_copysign_v3f32(ptr addrspace(1) %out, <3 x flo
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    s_load_b256 s[8:15], s[4:5], 0x34
 ; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX11-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v3, s12
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s14 :: v_dual_mov_b32 v1, s13
-; GFX11-NEXT:    v_mov_b32_e32 v3, s12
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-NEXT:    v_bfi_b32 v2, 0x7fffffff, s10, v0
 ; GFX11-NEXT:    v_bfi_b32 v1, 0x7fffffff, s9, v1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-NEXT:    v_bfi_b32 v0, 0x7fffffff, s8, v3
 ; GFX11-NEXT:    global_store_b96 v4, v[0:2], s[0:1]
 ; GFX11-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/fdiv.ll b/llvm/test/CodeGen/AMDGPU/fdiv.ll
index 33910947e6fac..b826e6c469d8e 100644
--- a/llvm/test/CodeGen/AMDGPU/fdiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/fdiv.ll
@@ -1164,8 +1164,8 @@ define amdgpu_kernel void @s_fdiv_v2f32(ptr addrspace(1) %out, <2 x float> %a, <
 ; GFX11-NEXT:    v_fma_f32 v5, -v2, v4, v0
 ; GFX11-NEXT:    v_fmac_f32_e32 v4, v5, v3
 ; GFX11-NEXT:    v_fma_f32 v0, -v2, v4, v0
-; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    s_denorm_mode 12
+; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    v_div_fmas_f32 v0, v0, v3, v4
 ; GFX11-NEXT:    v_div_fixup_f32 v0, v0, s2, s0
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
diff --git a/llvm/test/CodeGen/AMDGPU/fmed3.ll b/llvm/test/CodeGen/AMDGPU/fmed3.ll
index db0c5362bdc5f..c583b5b4d3e9a 100644
--- a/llvm/test/CodeGen/AMDGPU/fmed3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmed3.ll
@@ -8300,10 +8300,10 @@ define amdgpu_kernel void @two_non_inline_constant_multi_use(ptr addrspace(1) %o
 ; GFX11-SDAG-NEXT:    global_load_b32 v1, v0, s[2:3]
 ; GFX11-SDAG-NEXT:    s_mov_b32 s2, 0x41000000
 ; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT:    v_add_f32_e32 v3, 0x41800000, v1
 ; GFX11-SDAG-NEXT:    v_add_f32_e32 v2, 0.5, v1
+; GFX11-SDAG-NEXT:    v_add_f32_e32 v3, 0x41800000, v1
 ; GFX11-SDAG-NEXT:    v_add_f32_e32 v1, 0x41000000, v1
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-SDAG-NEXT:    v_med3_f32 v2, v2, s2, 0x41800000
 ; GFX11-SDAG-NEXT:    global_store_b32 v0, v2, s[0:1]
 ; GFX11-SDAG-NEXT:    global_store_b32 v[0:1], v3, off dlc
diff --git a/llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll b/llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll
index ff894d184e6c4..dfb8193b9532a 100644
--- a/llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll
+++ b/llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll
@@ -499,7 +499,7 @@ define double @fneg_xor_select_i64_user_with_srcmods(i1 %cond, i64 %arg0, i64 %a
 ; GFX11-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v0
-; GFX11-NEXT:    v_dual_cndmask_b32 v1, v3, v1 :: v_dual_cndmask_b32 v2, v4, v2
+; GFX11-NEXT:    v_dual_cndmask_b32 v2, v4, v2 :: v_dual_cndmask_b32 v1, v3, v1
 ; GFX11-NEXT:    v_add_f64 v[0:1], -v[1:2], 2.0
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %select = select i1 %cond, i64 %arg0, i64 %arg1
diff --git a/llvm/test/CodeGen/AMDGPU/fp-classify.ll b/llvm/test/CodeGen/AMDGPU/fp-classify.ll
index 498df8a65feda..200f74beec385 100644
--- a/llvm/test/CodeGen/AMDGPU/fp-classify.ll
+++ b/llvm/test/CodeGen/AMDGPU/fp-classify.ll
@@ -536,8 +536,8 @@ define amdgpu_kernel void @test_isfinite_pattern_4_commute_and(ptr addrspace(1)
 define amdgpu_kernel void @test_not_isfinite_pattern_4_wrong_ord_test(ptr addrspace(1) nocapture %out, float %x, [8 x i32], float %y) #0 {
 ; SI-LABEL: test_not_isfinite_pattern_4_wrong_ord_test:
 ; SI:       ; %bb.0:
-; SI-NEXT:    s_load_dword s0, s[4:5], 0x14
 ; SI-NEXT:    s_load_dwordx2 s[8:9], s[4:5], 0x9
+; SI-NEXT:    s_load_dword s0, s[4:5], 0x14
 ; SI-NEXT:    s_load_dword s1, s[4:5], 0xb
 ; SI-NEXT:    s_mov_b32 s11, 0xf000
 ; SI-NEXT:    s_mov_b32 s10, -1
diff --git a/llvm/test/CodeGen/AMDGPU/freeze.ll b/llvm/test/CodeGen/AMDGPU/freeze.ll
index ff9b0641e43d8..ac438062ae208 100644
--- a/llvm/test/CodeGen/AMDGPU/freeze.ll
+++ b/llvm/test/CodeGen/AMDGPU/freeze.ll
@@ -2031,9 +2031,9 @@ define void @freeze_v15i32(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb) {
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v8, vcc, 16, v0
 ; GFX8-GISEL-NEXT:    flat_load_dwordx4 v[4:7], v[0:1]
 ; GFX8-GISEL-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
-; GFX8-GISEL-NEXT:    flat_load_dwordx4 v[8:11], v[8:9]
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v12, vcc, 32, v0
 ; GFX8-GISEL-NEXT:    v_addc_u32_e32 v13, vcc, 0, v1, vcc
+; GFX8-GISEL-NEXT:    flat_load_dwordx4 v[8:11], v[8:9]
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v0, vcc, 48, v0
 ; GFX8-GISEL-NEXT:    flat_load_dwordx4 v[12:15], v[12:13]
 ; GFX8-GISEL-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
@@ -10417,9 +10417,9 @@ define void @freeze_v8p3(ptr addrspace(3) %ptra, ptr addrspace(3) %ptrb) {
 ; GFX6-GISEL-LABEL: freeze_v8p3:
 ; GFX6-GISEL:       ; %bb.0:
 ; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX6-GISEL-NEXT:    s_mov_b32 m0, -1
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v4, vcc, 8, v0
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 16, v0
-; GFX6-GISEL-NEXT:    s_mov_b32 m0, -1
 ; GFX6-GISEL-NEXT:    ds_read_b64 v[2:3], v0
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v0, vcc, 24, v0
 ; GFX6-GISEL-NEXT:    ds_read_b64 v[4:5], v4
@@ -10546,14 +10546,14 @@ define void @freeze_v16p3(ptr addrspace(3) %ptra, ptr addrspace(3) %ptrb) {
 ; GFX6-SDAG-LABEL: freeze_v16p3:
 ; GFX6-SDAG:       ; %bb.0:
 ; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v6, vcc, 8, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v2, vcc, 24, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v4, vcc, 16, v0
-; GFX6-SDAG-NEXT:    s_mov_b32 m0, -1
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v12, vcc, 40, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v14, vcc, 32, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v2, vcc, 8, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v4, vcc, 24, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v6, vcc, 16, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v10, vcc, 40, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v12, vcc, 32, v0
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v16, vcc, 56, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v10, vcc, 48, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v14, vcc, 48, v0
+; GFX6-SDAG-NEXT:    s_mov_b32 m0, -1
 ; GFX6-SDAG-NEXT:    ds_read_b64 v[2:3], v2
 ; GFX6-SDAG-NEXT:    ds_read_b64 v[4:5], v4
 ; GFX6-SDAG-NEXT:    ds_read_b64 v[6:7], v6
@@ -10563,22 +10563,23 @@ define void @freeze_v16p3(ptr addrspace(3) %ptra, ptr addrspace(3) %ptrb) {
 ; GFX6-SDAG-NEXT:    ds_read_b64 v[14:15], v14
 ; GFX6-SDAG-NEXT:    ds_read_b64 v[16:17], v16
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 48, v1
-; GFX6-SDAG-NEXT:    s_waitcnt lgkmcnt(3)
-; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[10:11]
+; GFX6-SDAG-NEXT:    s_waitcnt lgkmcnt(4)
+; GFX6-SDAG-NEXT:    ds_write_b64 v1, v[8:9]
+; GFX6-SDAG-NEXT:    s_waitcnt lgkmcnt(2)
+; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[14:15]
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 56, v1
-; GFX6-SDAG-NEXT:    s_waitcnt lgkmcnt(1)
+; GFX6-SDAG-NEXT:    s_waitcnt lgkmcnt(2)
 ; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[16:17]
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 32, v1
-; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[14:15]
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 40, v1
 ; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[12:13]
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 40, v1
+; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[10:11]
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 16, v1
-; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[4:5]
+; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[6:7]
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 24, v1
-; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[2:3]
+; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[4:5]
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 8, v1
-; GFX6-SDAG-NEXT:    ds_write_b64 v1, v[8:9]
-; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[6:7]
+; GFX6-SDAG-NEXT:    ds_write_b64 v0, v[2:3]
 ; GFX6-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX6-SDAG-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -11565,22 +11566,22 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v9, vcc, 44, v0
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v10, vcc, 40, v0
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v11, vcc, 36, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v12, vcc, 28, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v13, vcc, 24, v0
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v14, vcc, 20, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v12, vcc, 32, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v13, vcc, 28, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v14, vcc, 24, v0
+; GFX6-SDAG-NEXT:    v_add_i32_e32 v15, vcc, 20, v0
 ; GFX6-SDAG-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v4, v4, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v9, v9, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v10, v10, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v11, v11, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    buffer_load_dword v15, v0, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    buffer_load_dword v13, v13, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    buffer_load_dword v16, v0, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    buffer_load_dword v15, v15, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v14, v14, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    v_add_i32_e32 v16, vcc, 32, v0
+; GFX6-SDAG-NEXT:    buffer_load_dword v13, v13, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 60, v0
-; GFX6-SDAG-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v17, vcc, 4, v1
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v18, vcc, 8, v1
@@ -11603,13 +11604,15 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v19, vcc, 36, v1
 ; GFX6-SDAG-NEXT:    v_add_i32_e32 v8, vcc, 44, v1
 ; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(9)
-; GFX6-SDAG-NEXT:    buffer_store_dword v15, v1, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(7)
-; GFX6-SDAG-NEXT:    buffer_store_dword v14, v17, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    buffer_store_dword v13, v7, s[0:3], 0 offen
-; GFX6-SDAG-NEXT:    buffer_store_dword v12, v18, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    buffer_store_dword v16, v1, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(9)
+; GFX6-SDAG-NEXT:    buffer_store_dword v15, v17, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(9)
+; GFX6-SDAG-NEXT:    buffer_store_dword v14, v7, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(9)
-; GFX6-SDAG-NEXT:    buffer_store_dword v16, v6, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    buffer_store_dword v13, v18, s[0:3], 0 offen
+; GFX6-SDAG-NEXT:    s_waitcnt vmcnt(9)
+; GFX6-SDAG-NEXT:    buffer_store_dword v12, v6, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_store_dword v11, v19, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_store_dword v10, v5, s[0:3], 0 offen
 ; GFX6-SDAG-NEXT:    buffer_store_dword v9, v8, s[0:3], 0 offen
@@ -11631,24 +11634,24 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX6-GISEL:       ; %bb.0:
 ; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 4, v0
-; GFX6-GISEL-NEXT:    buffer_load_dword v4, v0, s[0:3], 0 offen
-; GFX6-GISEL-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v3, vcc, 8, v0
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 12, v0
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 16, v0
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v7, vcc, 20, v0
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v8, vcc, 24, v0
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v9, vcc, 28, v0
+; GFX6-GISEL-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v4, vcc, 12, v0
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 16, v0
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 20, v0
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v7, vcc, 24, v0
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v8, vcc, 28, v0
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v10, vcc, 32, v0
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v11, vcc, 36, v0
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v12, vcc, 40, v0
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v13, vcc, 44, v0
-; GFX6-GISEL-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_load_dword v4, v4, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_load_dword v9, v0, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v5, v5, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v6, v6, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v7, v7, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v8, v8, s[0:3], 0 offen
-; GFX6-GISEL-NEXT:    buffer_load_dword v9, v9, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v10, v10, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v11, v11, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
@@ -11658,8 +11661,8 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v15, vcc, 52, v0
 ; GFX6-GISEL-NEXT:    buffer_load_dword v15, v15, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v16, vcc, 56, v0
-; GFX6-GISEL-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v0, vcc, 60, v0
+; GFX6-GISEL-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v17, vcc, 4, v1
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v18, vcc, 8, v1
@@ -11669,30 +11672,32 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX6-GISEL-NEXT:    s_waitcnt expcnt(0)
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 16, v1
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v17, vcc, 20, v1
-; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(12)
-; GFX6-GISEL-NEXT:    buffer_store_dword v6, v2, s[0:3], 0 offen
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 40, v1
 ; GFX6-GISEL-NEXT:    buffer_store_dword v3, v18, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    s_waitcnt expcnt(0)
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v3, vcc, 24, v1
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v18, vcc, 28, v1
-; GFX6-GISEL-NEXT:    buffer_store_dword v5, v19, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(14)
+; GFX6-GISEL-NEXT:    buffer_store_dword v4, v19, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    s_waitcnt expcnt(0)
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 32, v1
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v4, vcc, 32, v1
+; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(14)
+; GFX6-GISEL-NEXT:    buffer_store_dword v5, v2, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 40, v1
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v19, vcc, 36, v1
-; GFX6-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 44, v1
-; GFX6-GISEL-NEXT:    buffer_store_dword v4, v1, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    s_waitcnt expcnt(0)
+; GFX6-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 44, v1
+; GFX6-GISEL-NEXT:    buffer_store_dword v9, v1, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX6-GISEL-NEXT:    buffer_store_dword v7, v17, s[0:3], 0 offen
-; GFX6-GISEL-NEXT:    buffer_store_dword v8, v3, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_store_dword v6, v17, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_store_dword v7, v3, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX6-GISEL-NEXT:    buffer_store_dword v9, v18, s[0:3], 0 offen
-; GFX6-GISEL-NEXT:    buffer_store_dword v10, v5, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_store_dword v8, v18, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_store_dword v10, v4, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(14)
 ; GFX6-GISEL-NEXT:    buffer_store_dword v11, v19, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    buffer_store_dword v12, v2, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX6-GISEL-NEXT:    buffer_store_dword v13, v6, s[0:3], 0 offen
+; GFX6-GISEL-NEXT:    buffer_store_dword v13, v5, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 48, v1
 ; GFX6-GISEL-NEXT:    buffer_store_dword v14, v2, s[0:3], 0 offen
 ; GFX6-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 52, v1
@@ -11723,22 +11728,22 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v9, vcc, 44, v0
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v10, vcc, 40, v0
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v11, vcc, 36, v0
-; GFX7-SDAG-NEXT:    v_add_i32_e32 v12, vcc, 28, v0
-; GFX7-SDAG-NEXT:    v_add_i32_e32 v13, vcc, 24, v0
-; GFX7-SDAG-NEXT:    v_add_i32_e32 v14, vcc, 20, v0
+; GFX7-SDAG-NEXT:    v_add_i32_e32 v12, vcc, 32, v0
+; GFX7-SDAG-NEXT:    v_add_i32_e32 v13, vcc, 28, v0
+; GFX7-SDAG-NEXT:    v_add_i32_e32 v14, vcc, 24, v0
+; GFX7-SDAG-NEXT:    v_add_i32_e32 v15, vcc, 20, v0
 ; GFX7-SDAG-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v4, v4, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v9, v9, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v10, v10, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v11, v11, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    buffer_load_dword v15, v0, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    buffer_load_dword v13, v13, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    buffer_load_dword v16, v0, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    buffer_load_dword v15, v15, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v14, v14, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    v_add_i32_e32 v16, vcc, 32, v0
+; GFX7-SDAG-NEXT:    buffer_load_dword v13, v13, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v0, vcc, 60, v0
-; GFX7-SDAG-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v17, vcc, 4, v1
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v18, vcc, 8, v1
@@ -11759,13 +11764,15 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v19, vcc, 36, v1
 ; GFX7-SDAG-NEXT:    v_add_i32_e32 v8, vcc, 44, v1
 ; GFX7-SDAG-NEXT:    s_waitcnt vmcnt(9)
-; GFX7-SDAG-NEXT:    buffer_store_dword v15, v1, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    s_waitcnt vmcnt(7)
-; GFX7-SDAG-NEXT:    buffer_store_dword v14, v17, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    buffer_store_dword v13, v7, s[0:3], 0 offen
-; GFX7-SDAG-NEXT:    buffer_store_dword v12, v18, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    buffer_store_dword v16, v1, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    s_waitcnt vmcnt(9)
+; GFX7-SDAG-NEXT:    buffer_store_dword v15, v17, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    s_waitcnt vmcnt(9)
+; GFX7-SDAG-NEXT:    buffer_store_dword v14, v7, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    s_waitcnt vmcnt(9)
-; GFX7-SDAG-NEXT:    buffer_store_dword v16, v6, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    buffer_store_dword v13, v18, s[0:3], 0 offen
+; GFX7-SDAG-NEXT:    s_waitcnt vmcnt(9)
+; GFX7-SDAG-NEXT:    buffer_store_dword v12, v6, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_store_dword v11, v19, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_store_dword v10, v5, s[0:3], 0 offen
 ; GFX7-SDAG-NEXT:    buffer_store_dword v9, v8, s[0:3], 0 offen
@@ -11785,24 +11792,24 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX7-GISEL:       ; %bb.0:
 ; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 4, v0
-; GFX7-GISEL-NEXT:    buffer_load_dword v4, v0, s[0:3], 0 offen
-; GFX7-GISEL-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v3, vcc, 8, v0
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 12, v0
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 16, v0
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v7, vcc, 20, v0
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v8, vcc, 24, v0
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v9, vcc, 28, v0
+; GFX7-GISEL-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v4, vcc, 12, v0
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 16, v0
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 20, v0
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v7, vcc, 24, v0
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v8, vcc, 28, v0
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v10, vcc, 32, v0
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v11, vcc, 36, v0
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v12, vcc, 40, v0
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v13, vcc, 44, v0
-; GFX7-GISEL-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_load_dword v4, v4, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_load_dword v9, v0, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v5, v5, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v6, v6, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v7, v7, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v8, v8, s[0:3], 0 offen
-; GFX7-GISEL-NEXT:    buffer_load_dword v9, v9, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v10, v10, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v11, v11, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
@@ -11812,8 +11819,8 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v15, vcc, 52, v0
 ; GFX7-GISEL-NEXT:    buffer_load_dword v15, v15, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v16, vcc, 56, v0
-; GFX7-GISEL-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v0, vcc, 60, v0
+; GFX7-GISEL-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v17, vcc, 4, v1
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v18, vcc, 8, v1
@@ -11822,28 +11829,29 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX7-GISEL-NEXT:    buffer_store_dword v2, v17, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 16, v1
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v17, vcc, 20, v1
-; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(12)
-; GFX7-GISEL-NEXT:    buffer_store_dword v6, v2, s[0:3], 0 offen
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 40, v1
 ; GFX7-GISEL-NEXT:    buffer_store_dword v3, v18, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v3, vcc, 24, v1
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v18, vcc, 28, v1
-; GFX7-GISEL-NEXT:    buffer_store_dword v5, v19, s[0:3], 0 offen
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 32, v1
+; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(14)
+; GFX7-GISEL-NEXT:    buffer_store_dword v4, v19, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v4, vcc, 32, v1
+; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(14)
+; GFX7-GISEL-NEXT:    buffer_store_dword v5, v2, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 40, v1
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v19, vcc, 36, v1
-; GFX7-GISEL-NEXT:    v_add_i32_e32 v6, vcc, 44, v1
-; GFX7-GISEL-NEXT:    buffer_store_dword v4, v1, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    v_add_i32_e32 v5, vcc, 44, v1
+; GFX7-GISEL-NEXT:    buffer_store_dword v9, v1, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX7-GISEL-NEXT:    buffer_store_dword v7, v17, s[0:3], 0 offen
-; GFX7-GISEL-NEXT:    buffer_store_dword v8, v3, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_store_dword v6, v17, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_store_dword v7, v3, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX7-GISEL-NEXT:    buffer_store_dword v9, v18, s[0:3], 0 offen
-; GFX7-GISEL-NEXT:    buffer_store_dword v10, v5, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_store_dword v8, v18, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_store_dword v10, v4, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(14)
 ; GFX7-GISEL-NEXT:    buffer_store_dword v11, v19, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    buffer_store_dword v12, v2, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX7-GISEL-NEXT:    buffer_store_dword v13, v6, s[0:3], 0 offen
+; GFX7-GISEL-NEXT:    buffer_store_dword v13, v5, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 48, v1
 ; GFX7-GISEL-NEXT:    buffer_store_dword v14, v2, s[0:3], 0 offen
 ; GFX7-GISEL-NEXT:    v_add_i32_e32 v2, vcc, 52, v1
@@ -11861,24 +11869,24 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX8-GISEL:       ; %bb.0:
 ; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v2, vcc, 4, v0
-; GFX8-GISEL-NEXT:    buffer_load_dword v4, v0, s[0:3], 0 offen
-; GFX8-GISEL-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v3, vcc, 8, v0
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v5, vcc, 12, v0
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v6, vcc, 16, v0
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v7, vcc, 20, v0
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v8, vcc, 24, v0
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v9, vcc, 28, v0
+; GFX8-GISEL-NEXT:    buffer_load_dword v2, v2, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v4, vcc, 12, v0
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v5, vcc, 16, v0
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v6, vcc, 20, v0
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v7, vcc, 24, v0
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v8, vcc, 28, v0
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v10, vcc, 32, v0
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v11, vcc, 36, v0
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v12, vcc, 40, v0
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v13, vcc, 44, v0
-; GFX8-GISEL-NEXT:    buffer_load_dword v3, v3, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_load_dword v4, v4, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_load_dword v9, v0, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v5, v5, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v6, v6, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v7, v7, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v8, v8, s[0:3], 0 offen
-; GFX8-GISEL-NEXT:    buffer_load_dword v9, v9, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v10, v10, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v11, v11, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v12, v12, s[0:3], 0 offen
@@ -11888,8 +11896,8 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v15, vcc, 52, v0
 ; GFX8-GISEL-NEXT:    buffer_load_dword v15, v15, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v16, vcc, 56, v0
-; GFX8-GISEL-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v0, vcc, 60, v0
+; GFX8-GISEL-NEXT:    buffer_load_dword v16, v16, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v17, vcc, 4, v1
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v18, vcc, 8, v1
@@ -11898,28 +11906,29 @@ define void @freeze_v16p5(ptr addrspace(5) %ptra, ptr addrspace(5) %ptrb) {
 ; GFX8-GISEL-NEXT:    buffer_store_dword v2, v17, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v2, vcc, 16, v1
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v17, vcc, 20, v1
-; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(12)
-; GFX8-GISEL-NEXT:    buffer_store_dword v6, v2, s[0:3], 0 offen
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v2, vcc, 40, v1
 ; GFX8-GISEL-NEXT:    buffer_store_dword v3, v18, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v3, vcc, 24, v1
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v18, vcc, 28, v1
-; GFX8-GISEL-NEXT:    buffer_store_dword v5, v19, s[0:3], 0 offen
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v5, vcc, 32, v1
+; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(14)
+; GFX8-GISEL-NEXT:    buffer_store_dword v4, v19, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v4, vcc, 32, v1
+; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(14)
+; GFX8-GISEL-NEXT:    buffer_store_dword v5, v2, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v2, vcc, 40, v1
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v19, vcc, 36, v1
-; GFX8-GISEL-NEXT:    v_add_u32_e32 v6, vcc, 44, v1
-; GFX8-GISEL-NEXT:    buffer_store_dword v4, v1, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    v_add_u32_e32 v5, vcc, 44, v1
+; GFX8-GISEL-NEXT:    buffer_store_dword v9, v1, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX8-GISEL-NEXT:    buffer_store_dword v7, v17, s[0:3], 0 offen
-; GFX8-GISEL-NEXT:    buffer_store_dword v8, v3, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_store_dword v6, v17, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_store_dword v7, v3, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX8-GISEL-NEXT:    buffer_store_dword v9, v18, s[0:3], 0 offen
-; GFX8-GISEL-NEXT:    buffer_store_dword v10, v5, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_store_dword v8, v18, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_store_dword v10, v4, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(14)
 ; GFX8-GISEL-NEXT:    buffer_store_dword v11, v19, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    buffer_store_dword v12, v2, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    s_waitcnt vmcnt(14)
-; GFX8-GISEL-NEXT:    buffer_store_dword v13, v6, s[0:3], 0 offen
+; GFX8-GISEL-NEXT:    buffer_store_dword v13, v5, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v2, vcc, 48, v1
 ; GFX8-GISEL-NEXT:    buffer_store_dword v14, v2, s[0:3], 0 offen
 ; GFX8-GISEL-NEXT:    v_add_u32_e32 v2, vcc, 52, v1
diff --git a/llvm/test/CodeGen/AMDGPU/function-args-inreg.ll b/llvm/test/CodeGen/AMDGPU/function-args-inreg.ll
index 0db2a1679197e..831d10480c51c 100644
--- a/llvm/test/CodeGen/AMDGPU/function-args-inreg.ll
+++ b/llvm/test/CodeGen/AMDGPU/function-args-inreg.ll
@@ -1591,7 +1591,7 @@ define void @too_many_args_use_workitem_id_x_inreg(
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    global_store_b32 v[0:1], v18, off dlc
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX11-NEXT:    v_dual_mov_b32 v15, s21 :: v_dual_mov_b32 v14, s20
+; GFX11-NEXT:    v_dual_mov_b32 v14, s20 :: v_dual_mov_b32 v15, s21
 ; GFX11-NEXT:    v_dual_mov_b32 v16, s22 :: v_dual_mov_b32 v17, s23
 ; GFX11-NEXT:    v_mov_b32_e32 v18, s24
 ; GFX11-NEXT:    global_store_b32 v[0:1], v14, off dlc
@@ -1604,8 +1604,8 @@ define void @too_many_args_use_workitem_id_x_inreg(
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    global_store_b32 v[0:1], v18, off dlc
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX11-NEXT:    v_dual_mov_b32 v17, s28 :: v_dual_mov_b32 v14, s25
-; GFX11-NEXT:    v_dual_mov_b32 v15, s26 :: v_dual_mov_b32 v16, s27
+; GFX11-NEXT:    v_dual_mov_b32 v14, s25 :: v_dual_mov_b32 v15, s26
+; GFX11-NEXT:    v_dual_mov_b32 v16, s27 :: v_dual_mov_b32 v17, s28
 ; GFX11-NEXT:    v_mov_b32_e32 v18, s29
 ; GFX11-NEXT:    global_store_b32 v[0:1], v14, off dlc
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
diff --git a/llvm/test/CodeGen/AMDGPU/function-args.ll b/llvm/test/CodeGen/AMDGPU/function-args.ll
index 81b8b36180746..a901d7f97eb37 100644
--- a/llvm/test/CodeGen/AMDGPU/function-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/function-args.ll
@@ -3380,42 +3380,117 @@ define void @void_func_v32i32_v2i16_v2f16_v2bf16_v4bf16(<32 x i32> %arg0, <2 x i
 }
 
 define void @void_func_v32i32_v2i64_v2f64(<32 x i32> %arg0, <2 x i64> %arg1, <2 x double> %arg2) #0 {
-; CIGFX89-LABEL: void_func_v32i32_v2i64_v2f64:
-; CIGFX89:       ; %bb.0:
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CIGFX89-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; CIGFX89-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:20
-; CIGFX89-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:16
-; CIGFX89-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:12
-; CIGFX89-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
-; CIGFX89-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
-; CIGFX89-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:32
-; CIGFX89-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:28
-; CIGFX89-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:24
-; CIGFX89-NEXT:    s_mov_b32 s7, 0xf000
-; CIGFX89-NEXT:    s_mov_b32 s6, -1
-; CIGFX89-NEXT:    s_waitcnt vmcnt(8)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[24:27], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
-; CIGFX89-NEXT:    s_waitcnt vmcnt(0)
-; CIGFX89-NEXT:    s_setpc_b64 s[30:31]
+; CI-LABEL: void_func_v32i32_v2i64_v2f64:
+; CI:       ; %bb.0:
+; CI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
+; CI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:32
+; CI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:28
+; CI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:24
+; CI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:16
+; CI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:12
+; CI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:8
+; CI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:4
+; CI-NEXT:    s_mov_b32 s7, 0xf000
+; CI-NEXT:    s_mov_b32 s6, -1
+; CI-NEXT:    s_waitcnt vmcnt(7)
+; CI-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[24:27], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:20
+; CI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[35:38], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[31:34], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    s_setpc_b64 s[30:31]
+;
+; VI-LABEL: void_func_v32i32_v2i64_v2f64:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
+; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:4
+; VI-NEXT:    s_mov_b32 s7, 0xf000
+; VI-NEXT:    s_mov_b32 s6, -1
+; VI-NEXT:    s_waitcnt vmcnt(7)
+; VI-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[24:27], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[35:38], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[31:34], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-LABEL: void_func_v32i32_v2i64_v2f64:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
+; GFX9-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    s_mov_b32 s7, 0xf000
+; GFX9-NEXT:    s_mov_b32 s6, -1
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[24:27], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[35:38], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[31:34], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: void_func_v32i32_v2i64_v2f64:
 ; GFX11:       ; %bb.0:
@@ -3552,13 +3627,13 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; CI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    s_mov_b32 s6, -1
-; CI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:64
-; CI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:60
-; CI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:56
-; CI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:52
-; CI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:16
-; CI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:12
-; CI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:8
+; CI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:16
+; CI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:12
+; CI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
+; CI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; CI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:32
+; CI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:28
+; CI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:24
 ; CI-NEXT:    s_waitcnt vmcnt(7)
 ; CI-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
@@ -3570,29 +3645,29 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:4
-; CI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:32
-; CI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:28
-; CI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:24
-; CI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:20
-; CI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:48
-; CI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:44
-; CI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:40
-; CI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:36
+; CI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:20
+; CI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:48
+; CI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:44
+; CI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:40
+; CI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:64
+; CI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:60
+; CI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:56
+; CI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
+; CI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:36
 ; CI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
-; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: void_func_v32i32_v8i32_v8f32:
@@ -3601,13 +3676,13 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    s_mov_b32 s7, 0xf000
 ; VI-NEXT:    s_mov_b32 s6, -1
-; VI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:56
-; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:52
-; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:24
 ; VI-NEXT:    s_waitcnt vmcnt(7)
 ; VI-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -3619,29 +3694,29 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:4
-; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:24
-; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:48
-; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:44
-; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:40
-; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:36
+; VI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:44
+; VI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:40
+; VI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:64
+; VI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:56
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
+; VI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:36
 ; VI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
-; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: void_func_v32i32_v8i32_v8f32:
@@ -3650,13 +3725,13 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    s_mov_b32 s7, 0xf000
 ; GFX9-NEXT:    s_mov_b32 s6, -1
-; GFX9-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:56
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:52
-; GFX9-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_dword v39, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:24
 ; GFX9-NEXT:    s_waitcnt vmcnt(7)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[28:31], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -3668,15 +3743,15 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:48
-; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:40
-; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:36
+; GFX9-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:44
+; GFX9-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:40
+; GFX9-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:64
+; GFX9-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:56
+; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:52
+; GFX9-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:36
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -3684,14 +3759,14 @@ define void @void_func_v32i32_v8i32_v8f32(<32 x i32> %arg0, <8 x i32> %arg1, <8
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: void_func_v32i32_v8i32_v8f32:
@@ -3791,40 +3866,40 @@ define void @void_func_v32i32_v16i32_v16f32(<32 x i32> %arg0, <16 x i32> %arg1,
 ; CI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:4
-; CI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:96
-; CI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
-; CI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:88
-; CI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:84
-; CI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:112
-; CI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:108
-; CI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:104
+; CI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:112
+; CI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:108
+; CI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:104
+; CI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
+; CI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:128
+; CI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:124
+; CI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:120
 ; CI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:100
-; CI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:128
-; CI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:124
-; CI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:120
-; CI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:116
-; CI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:80
-; CI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:76
-; CI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:72
-; CI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:68
+; CI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
+; CI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:80
+; CI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:76
+; CI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:72
+; CI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:96
+; CI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:92
+; CI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:88
+; CI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:84
+; CI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:68
 ; CI-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
-; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: void_func_v32i32_v16i32_v16f32:
@@ -3864,40 +3939,40 @@ define void @void_func_v32i32_v16i32_v16f32(<32 x i32> %arg0, <16 x i32> %arg1,
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:4
-; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:96
-; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
-; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:88
-; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:84
-; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:112
-; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:108
-; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:104
+; VI-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:112
+; VI-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:108
+; VI-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:104
+; VI-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
+; VI-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:128
+; VI-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:124
+; VI-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:120
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:128
-; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:124
-; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:120
-; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:116
-; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:80
-; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:76
-; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:72
-; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:68
+; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
+; VI-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:80
+; VI-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:76
+; VI-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:72
+; VI-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:96
+; VI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:92
+; VI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:88
+; VI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:84
+; VI-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:68
 ; VI-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
-; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: void_func_v32i32_v16i32_v16f32:
@@ -3938,27 +4013,27 @@ define void @void_func_v32i32_v16i32_v16f32(<32 x i32> %arg0, <16 x i32> %arg1,
 ; GFX9-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:4
-; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:96
-; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:92
-; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:88
-; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:84
-; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:112
-; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:108
-; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:104
+; GFX9-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:112
+; GFX9-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:108
+; GFX9-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:104
+; GFX9-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:100
+; GFX9-NEXT:    buffer_load_dword v11, off, s[0:3], s32 offset:128
+; GFX9-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:124
+; GFX9-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:120
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[32:35], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:100
-; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:128
-; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:124
-; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:120
-; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:116
-; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:80
-; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:76
-; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:72
-; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:68
+; GFX9-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:116
+; GFX9-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:80
+; GFX9-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:76
+; GFX9-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:72
+; GFX9-NEXT:    buffer_load_dword v23, off, s[0:3], s32 offset:96
+; GFX9-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:92
+; GFX9-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:88
+; GFX9-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:84
+; GFX9-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:68
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dwordx4 v[36:39], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -3966,14 +4041,14 @@ define void @void_func_v32i32_v16i32_v16f32(<32 x i32> %arg0, <16 x i32> %arg1,
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[4:7], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[20:23], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: void_func_v32i32_v16i32_v16f32:
@@ -4259,9 +4334,9 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; CI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    s_mov_b32 s6, -1
-; CI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:60
-; CI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:64
-; CI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:48
+; CI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:48
+; CI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:60
+; CI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:64
 ; CI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:52
 ; CI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:56
 ; CI-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:36
@@ -4275,16 +4350,16 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:28
-; CI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:32
-; CI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:20
-; CI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:24
 ; CI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:16
-; CI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:12
-; CI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:8
-; CI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:4
+; CI-NEXT:    buffer_load_dword v12, off, s[0:3], s32 offset:32
+; CI-NEXT:    buffer_load_dword v13, off, s[0:3], s32 offset:28
+; CI-NEXT:    buffer_load_dword v14, off, s[0:3], s32 offset:24
+; CI-NEXT:    buffer_load_dword v15, off, s[0:3], s32 offset:20
+; CI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:16
+; CI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:12
+; CI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:8
+; CI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:4
 ; CI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:44
 ; CI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
@@ -4292,15 +4367,15 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v33, off, s[4:7], 0
+; CI-NEXT:    buffer_store_byte v34, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v32, off, s[4:7], 0
+; CI-NEXT:    buffer_store_byte v33, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v36, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v35, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v34, off, s[4:7], 0
+; CI-NEXT:    buffer_store_byte v32, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v20, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
@@ -4308,14 +4383,6 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v37, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v17, off, s[4:7], 0
-; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v16, off, s[4:7], 0
-; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v19, off, s[4:7], 0
-; CI-NEXT:    s_waitcnt vmcnt(0)
-; CI-NEXT:    buffer_store_byte v18, off, s[4:7], 0
-; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v12, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v13, off, s[4:7], 0
@@ -4324,6 +4391,14 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_store_byte v15, off, s[4:7], 0
 ; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_byte v16, off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_byte v17, off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_byte v18, off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
+; CI-NEXT:    buffer_store_byte v19, off, s[4:7], 0
+; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: void_func_v32i32_v16i8:
@@ -4332,9 +4407,9 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; VI-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; VI-NEXT:    s_mov_b32 s7, 0xf000
 ; VI-NEXT:    s_mov_b32 s6, -1
-; VI-NEXT:    buffer_load_ubyte v32, off, s[0:3], s32 offset:60
-; VI-NEXT:    buffer_load_ubyte v33, off, s[0:3], s32 offset:64
-; VI-NEXT:    buffer_load_ubyte v34, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ubyte v32, off, s[0:3], s32 offset:48
+; VI-NEXT:    buffer_load_ubyte v33, off, s[0:3], s32 offset:60
+; VI-NEXT:    buffer_load_ubyte v34, off, s[0:3], s32 offset:64
 ; VI-NEXT:    buffer_load_ubyte v35, off, s[0:3], s32 offset:52
 ; VI-NEXT:    buffer_load_ubyte v36, off, s[0:3], s32 offset:56
 ; VI-NEXT:    buffer_load_ubyte v37, off, s[0:3], s32 offset:36
@@ -4348,16 +4423,16 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_load_ubyte v16, off, s[0:3], s32 offset:28
-; VI-NEXT:    buffer_load_ubyte v17, off, s[0:3], s32 offset:32
-; VI-NEXT:    buffer_load_ubyte v18, off, s[0:3], s32 offset:20
-; VI-NEXT:    buffer_load_ubyte v19, off, s[0:3], s32 offset:24
 ; VI-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_load_ubyte v12, off, s[0:3], s32 offset:16
-; VI-NEXT:    buffer_load_ubyte v13, off, s[0:3], s32 offset:12
-; VI-NEXT:    buffer_load_ubyte v14, off, s[0:3], s32 offset:8
-; VI-NEXT:    buffer_load_ubyte v15, off, s[0:3], s32 offset:4
+; VI-NEXT:    buffer_load_ubyte v12, off, s[0:3], s32 offset:32
+; VI-NEXT:    buffer_load_ubyte v13, off, s[0:3], s32 offset:28
+; VI-NEXT:    buffer_load_ubyte v14, off, s[0:3], s32 offset:24
+; VI-NEXT:    buffer_load_ubyte v15, off, s[0:3], s32 offset:20
+; VI-NEXT:    buffer_load_ubyte v16, off, s[0:3], s32 offset:16
+; VI-NEXT:    buffer_load_ubyte v17, off, s[0:3], s32 offset:12
+; VI-NEXT:    buffer_load_ubyte v18, off, s[0:3], s32 offset:8
+; VI-NEXT:    buffer_load_ubyte v19, off, s[0:3], s32 offset:4
 ; VI-NEXT:    buffer_load_ubyte v20, off, s[0:3], s32 offset:44
 ; VI-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -4365,15 +4440,15 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v33, off, s[4:7], 0
+; VI-NEXT:    buffer_store_byte v34, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v32, off, s[4:7], 0
+; VI-NEXT:    buffer_store_byte v33, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v36, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v35, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v34, off, s[4:7], 0
+; VI-NEXT:    buffer_store_byte v32, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v20, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -4381,14 +4456,6 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v37, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v17, off, s[4:7], 0
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v16, off, s[4:7], 0
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v19, off, s[4:7], 0
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    buffer_store_byte v18, off, s[4:7], 0
-; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v12, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v13, off, s[4:7], 0
@@ -4397,6 +4464,14 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_store_byte v15, off, s[4:7], 0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_byte v16, off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_byte v17, off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_byte v18, off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    buffer_store_byte v19, off, s[4:7], 0
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: void_func_v32i32_v16i8:
@@ -4405,9 +4480,9 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s32
 ; GFX9-NEXT:    s_mov_b32 s7, 0xf000
 ; GFX9-NEXT:    s_mov_b32 s6, -1
-; GFX9-NEXT:    buffer_load_ubyte v32, off, s[0:3], s32 offset:60
-; GFX9-NEXT:    buffer_load_ubyte v33, off, s[0:3], s32 offset:64
-; GFX9-NEXT:    buffer_load_ubyte v34, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ubyte v32, off, s[0:3], s32 offset:48
+; GFX9-NEXT:    buffer_load_ubyte v33, off, s[0:3], s32 offset:60
+; GFX9-NEXT:    buffer_load_ubyte v34, off, s[0:3], s32 offset:64
 ; GFX9-NEXT:    buffer_load_ubyte v35, off, s[0:3], s32 offset:52
 ; GFX9-NEXT:    buffer_load_ubyte v36, off, s[0:3], s32 offset:56
 ; GFX9-NEXT:    buffer_load_ubyte v37, off, s[0:3], s32 offset:36
@@ -4421,18 +4496,17 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[16:19], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_load_ubyte v16, off, s[0:3], s32 offset:28
-; GFX9-NEXT:    buffer_load_ubyte v17, off, s[0:3], s32 offset:32
-; GFX9-NEXT:    buffer_load_ubyte v18, off, s[0:3], s32 offset:20
-; GFX9-NEXT:    buffer_load_ubyte v19, off, s[0:3], s32 offset:24
-; GFX9-NEXT:    buffer_load_ubyte v20, off, s[0:3], s32 offset:44
-; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dwordx4 v[12:15], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_load_ubyte v12, off, s[0:3], s32 offset:16
-; GFX9-NEXT:    buffer_load_ubyte v13, off, s[0:3], s32 offset:12
-; GFX9-NEXT:    buffer_load_ubyte v14, off, s[0:3], s32 offset:8
-; GFX9-NEXT:    buffer_load_ubyte v15, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ubyte v12, off, s[0:3], s32 offset:32
+; GFX9-NEXT:    buffer_load_ubyte v13, off, s[0:3], s32 offset:28
+; GFX9-NEXT:    buffer_load_ubyte v14, off, s[0:3], s32 offset:24
+; GFX9-NEXT:    buffer_load_ubyte v15, off, s[0:3], s32 offset:20
+; GFX9-NEXT:    buffer_load_ubyte v16, off, s[0:3], s32 offset:16
+; GFX9-NEXT:    buffer_load_ubyte v17, off, s[0:3], s32 offset:12
+; GFX9-NEXT:    buffer_load_ubyte v18, off, s[0:3], s32 offset:8
+; GFX9-NEXT:    buffer_load_ubyte v19, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_load_ubyte v20, off, s[0:3], s32 offset:44
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    buffer_store_dwordx4 v[8:11], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -4440,15 +4514,15 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v33, off, s[4:7], 0
+; GFX9-NEXT:    buffer_store_byte v34, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v32, off, s[4:7], 0
+; GFX9-NEXT:    buffer_store_byte v33, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v36, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v35, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v34, off, s[4:7], 0
+; GFX9-NEXT:    buffer_store_byte v32, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v20, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
@@ -4456,14 +4530,6 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v37, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v17, off, s[4:7], 0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v16, off, s[4:7], 0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v19, off, s[4:7], 0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    buffer_store_byte v18, off, s[4:7], 0
-; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v12, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v13, off, s[4:7], 0
@@ -4472,6 +4538,14 @@ define void @void_func_v32i32_v16i8(<32 x i32> %arg0, <16 x i8> %arg1) #0 {
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_store_byte v15, off, s[4:7], 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_byte v16, off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_byte v17, off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_byte v18, off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    buffer_store_byte v19, off, s[4:7], 0
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-TRUE16-LABEL: void_func_v32i32_v16i8:
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
index ca9cb456fa19f..e40917d4307fb 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
@@ -3670,10 +3670,10 @@ define amdgpu_gfx void @test_call_external_void_func_v5i8() #0 {
 ; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s30, 0
 ; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s31, 1
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, v5
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[3:4], 24, v[5:6]
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v1, 8, v5
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v2, 16, v5
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, v5
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v4, v6
 ; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-FAKE16-NEXT:    v_readlane_b32 s31, v40, 1
@@ -4186,9 +4186,10 @@ define amdgpu_gfx void @test_call_external_void_func_v32i8() #0 {
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v12, v3
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v20, v17
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v24, v18
-; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v28, v19 :: v_dual_mov_b32 v19, v34
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v28, v19
 ; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v2, v36 :: v_dual_mov_b32 v3, v37
 ; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v17, v32 :: v_dual_mov_b32 v18, v33
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v19, v34
 ; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-FAKE16-NEXT:    v_readlane_b32 s31, v40, 1
 ; GFX11-FAKE16-NEXT:    v_readlane_b32 s30, v40, 0
@@ -5346,14 +5347,14 @@ define amdgpu_gfx void @test_call_external_void_func_v5i8_ret() #0 {
 ; GFX11-FAKE16-NEXT:    v_writelane_b32 v42, s30, 0
 ; GFX11-FAKE16-NEXT:    v_writelane_b32 v42, s31, 1
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, v5
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b64 v[3:4], 24, v[5:6]
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v1, 8, v5
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v2, 16, v5
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, v5
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v4, v6
 ; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b16 v1, 8, v1
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, 0xff, v0
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b16 v3, 8, v3
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v2, 0xff, v2
@@ -5639,16 +5640,16 @@ define amdgpu_gfx void @test_call_external_void_func_v8i8_ret() #0 {
 ; GFX11-FAKE16-NEXT:    v_writelane_b32 v42, s30, 0
 ; GFX11-FAKE16-NEXT:    v_writelane_b32 v42, s31, 1
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v8, 8, v0
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v3, 24, v0
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
-; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v1, v8
+; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v1, v8
 ; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b16 v5, 8, v5
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v4, 0xff, v4
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b16 v7, 8, v7
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v6, 0xff, v6
@@ -6197,9 +6198,10 @@ define amdgpu_gfx void @test_call_external_void_func_v32i8_ret() #0 {
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v12, v3
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v20, v17
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v24, v18
-; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v28, v19 :: v_dual_mov_b32 v19, v34
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v28, v19
 ; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v2, v36 :: v_dual_mov_b32 v3, v37
 ; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v17, v32 :: v_dual_mov_b32 v18, v33
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v19, v34
 ; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-FAKE16-NEXT:    v_lshlrev_b16 v9, 8, v9
 ; GFX11-FAKE16-NEXT:    v_and_b32_e32 v8, 0xff, v8
@@ -9903,8 +9905,8 @@ define amdgpu_gfx void @test_call_external_void_func_v16i8() #0 {
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v15, 24, v3
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v8, v2
-; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v12, v3 :: v_dual_mov_b32 v3, v18
-; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v1, v16 :: v_dual_mov_b32 v2, v17
+; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v12, v3 :: v_dual_mov_b32 v1, v16
+; GFX11-FAKE16-NEXT:    v_dual_mov_b32 v2, v17 :: v_dual_mov_b32 v3, v18
 ; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-FAKE16-NEXT:    v_readlane_b32 s31, v40, 1
 ; GFX11-FAKE16-NEXT:    v_readlane_b32 s30, v40, 0
@@ -17250,22 +17252,21 @@ define amdgpu_gfx void @stack_8xv5f32() #0 {
 ; GFX11-NEXT:    s_add_i32 s0, s32, 16
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s32
 ; GFX11-NEXT:    scratch_store_b128 off, v[4:7], s0
-; GFX11-NEXT:    v_mov_b32_e32 v6, 1.0
 ; GFX11-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0
 ; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v3, 0
 ; GFX11-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v5, 1.0
-; GFX11-NEXT:    v_dual_mov_b32 v7, 1.0 :: v_dual_mov_b32 v8, 1.0
-; GFX11-NEXT:    v_dual_mov_b32 v9, 1.0 :: v_dual_mov_b32 v10, 2.0
-; GFX11-NEXT:    v_dual_mov_b32 v11, 2.0 :: v_dual_mov_b32 v12, 2.0
-; GFX11-NEXT:    v_dual_mov_b32 v13, 2.0 :: v_dual_mov_b32 v14, 2.0
-; GFX11-NEXT:    v_dual_mov_b32 v15, 0x40400000 :: v_dual_mov_b32 v16, 0x40400000
-; GFX11-NEXT:    v_dual_mov_b32 v17, 0x40400000 :: v_dual_mov_b32 v18, 0x40400000
-; GFX11-NEXT:    v_dual_mov_b32 v19, 0x40400000 :: v_dual_mov_b32 v20, 4.0
-; GFX11-NEXT:    v_dual_mov_b32 v21, 4.0 :: v_dual_mov_b32 v22, 4.0
-; GFX11-NEXT:    v_dual_mov_b32 v23, 4.0 :: v_dual_mov_b32 v24, 4.0
-; GFX11-NEXT:    v_dual_mov_b32 v25, 0x40a00000 :: v_dual_mov_b32 v26, 0x40a00000
-; GFX11-NEXT:    v_dual_mov_b32 v27, 0x40a00000 :: v_dual_mov_b32 v28, 0x40a00000
-; GFX11-NEXT:    v_mov_b32_e32 v29, 0x40a00000
+; GFX11-NEXT:    v_dual_mov_b32 v6, 1.0 :: v_dual_mov_b32 v7, 1.0
+; GFX11-NEXT:    v_dual_mov_b32 v8, 1.0 :: v_dual_mov_b32 v9, 1.0
+; GFX11-NEXT:    v_dual_mov_b32 v10, 2.0 :: v_dual_mov_b32 v11, 2.0
+; GFX11-NEXT:    v_dual_mov_b32 v12, 2.0 :: v_dual_mov_b32 v13, 2.0
+; GFX11-NEXT:    v_dual_mov_b32 v14, 2.0 :: v_dual_mov_b32 v15, 0x40400000
+; GFX11-NEXT:    v_dual_mov_b32 v16, 0x40400000 :: v_dual_mov_b32 v17, 0x40400000
+; GFX11-NEXT:    v_dual_mov_b32 v18, 0x40400000 :: v_dual_mov_b32 v19, 0x40400000
+; GFX11-NEXT:    v_dual_mov_b32 v20, 4.0 :: v_dual_mov_b32 v21, 4.0
+; GFX11-NEXT:    v_dual_mov_b32 v22, 4.0 :: v_dual_mov_b32 v23, 4.0
+; GFX11-NEXT:    v_dual_mov_b32 v24, 4.0 :: v_dual_mov_b32 v25, 0x40a00000
+; GFX11-NEXT:    v_dual_mov_b32 v26, 0x40a00000 :: v_dual_mov_b32 v27, 0x40a00000
+; GFX11-NEXT:    v_dual_mov_b32 v28, 0x40a00000 :: v_dual_mov_b32 v29, 0x40a00000
 ; GFX11-NEXT:    v_mov_b32_e32 v30, 0x40c00000
 ; GFX11-NEXT:    v_mov_b32_e32 v31, 0x40e00000
 ; GFX11-NEXT:    s_mov_b32 s1, external_void_func_8xv5f32 at abs32@hi
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll
index 6384fdba7a45a..668219875db72 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll
@@ -2450,22 +2450,21 @@ define amdgpu_gfx <72 x i32> @return_72xi32(<72 x i32> %val) #1 {
 ; GFX10-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:28
 ; GFX10-NEXT:    buffer_store_dword v7, v0, s[0:3], 0 offen offset:24
 ; GFX10-NEXT:    buffer_store_dword v6, v0, s[0:3], 0 offen offset:20
-; GFX10-NEXT:    s_clause 0x3
-; GFX10-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:12
-; GFX10-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:16
-; GFX10-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:20
-; GFX10-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:24
 ; GFX10-NEXT:    buffer_store_dword v5, v0, s[0:3], 0 offen offset:16
 ; GFX10-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen offset:12
-; GFX10-NEXT:    s_clause 0x3
-; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:4
-; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:8
-; GFX10-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:160
-; GFX10-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:28
 ; GFX10-NEXT:    buffer_store_dword v3, v0, s[0:3], 0 offen offset:8
-; GFX10-NEXT:    buffer_load_dword v3, off, s[0:3], s32
+; GFX10-NEXT:    s_clause 0x8
+; GFX10-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:28
+; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:24
+; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:20
+; GFX10-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:16
+; GFX10-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:12
+; GFX10-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:8
+; GFX10-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:4
+; GFX10-NEXT:    buffer_load_dword v10, off, s[0:3], s32
+; GFX10-NEXT:    buffer_load_dword v27, off, s[0:3], s32 offset:160
 ; GFX10-NEXT:    buffer_store_dword v2, v0, s[0:3], 0 offen offset:4
-; GFX10-NEXT:    s_waitcnt vmcnt(2)
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-NEXT:    buffer_store_dword v27, v0, s[0:3], 0 offen offset:284
 ; GFX10-NEXT:    buffer_store_dword v51, v0, s[0:3], 0 offen offset:280
 ; GFX10-NEXT:    buffer_store_dword v50, v0, s[0:3], 0 offen offset:276
@@ -2499,16 +2498,14 @@ define amdgpu_gfx <72 x i32> @return_72xi32(<72 x i32> %val) #1 {
 ; GFX10-NEXT:    buffer_store_dword v13, v0, s[0:3], 0 offen offset:164
 ; GFX10-NEXT:    buffer_store_dword v12, v0, s[0:3], 0 offen offset:160
 ; GFX10-NEXT:    buffer_store_dword v11, v0, s[0:3], 0 offen offset:156
-; GFX10-NEXT:    s_waitcnt vmcnt(1)
-; GFX10-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:152
-; GFX10-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:148
-; GFX10-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:144
-; GFX10-NEXT:    buffer_store_dword v7, v0, s[0:3], 0 offen offset:140
-; GFX10-NEXT:    buffer_store_dword v6, v0, s[0:3], 0 offen offset:136
-; GFX10-NEXT:    buffer_store_dword v5, v0, s[0:3], 0 offen offset:132
-; GFX10-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen offset:128
-; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    buffer_store_dword v3, v0, s[0:3], 0 offen offset:124
+; GFX10-NEXT:    buffer_store_dword v3, v0, s[0:3], 0 offen offset:152
+; GFX10-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen offset:148
+; GFX10-NEXT:    buffer_store_dword v5, v0, s[0:3], 0 offen offset:144
+; GFX10-NEXT:    buffer_store_dword v6, v0, s[0:3], 0 offen offset:140
+; GFX10-NEXT:    buffer_store_dword v7, v0, s[0:3], 0 offen offset:136
+; GFX10-NEXT:    buffer_store_dword v8, v0, s[0:3], 0 offen offset:132
+; GFX10-NEXT:    buffer_store_dword v9, v0, s[0:3], 0 offen offset:128
+; GFX10-NEXT:    buffer_store_dword v10, v0, s[0:3], 0 offen offset:124
 ; GFX10-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -3290,7 +3287,7 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    scratch_store_b128 off, v[28:31], s33 offset:1536 ; 16-byte Folded Spill
 ; GFX11-NEXT:    scratch_store_b128 off, v[32:35], s32
-; GFX11-NEXT:    v_dual_mov_b32 v31, v47 :: v_dual_mov_b32 v32, v36
+; GFX11-NEXT:    v_mov_b32_e32 v32, v36
 ; GFX11-NEXT:    v_dual_mov_b32 v33, v48 :: v_dual_mov_b32 v34, v49
 ; GFX11-NEXT:    v_dual_mov_b32 v35, v50 :: v_dual_mov_b32 v48, v51
 ; GFX11-NEXT:    v_dual_mov_b32 v49, v52 :: v_dual_mov_b32 v50, v53
@@ -3317,7 +3314,6 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    s_add_i32 s2, s32, 0x70
 ; GFX11-NEXT:    v_mov_b32_e32 v6, v17
 ; GFX11-NEXT:    scratch_store_b128 off, v[12:15], s2
-; GFX11-NEXT:    v_mov_b32_e32 v13, v24
 ; GFX11-NEXT:    s_add_i32 s2, s32, 0x6c
 ; GFX11-NEXT:    v_mov_b32_e32 v7, v18
 ; GFX11-NEXT:    scratch_store_b32 off, v0, s2
@@ -3328,26 +3324,27 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    v_dual_mov_b32 v12, v23 :: v_dual_mov_b32 v29, v45
 ; GFX11-NEXT:    scratch_store_b128 off, v[40:43], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 64
-; GFX11-NEXT:    v_mov_b32_e32 v14, v25
+; GFX11-NEXT:    v_mov_b32_e32 v13, v24
 ; GFX11-NEXT:    scratch_store_b128 off, v[52:55], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 48
-; GFX11-NEXT:    v_mov_b32_e32 v16, v27
+; GFX11-NEXT:    v_mov_b32_e32 v14, v25
 ; GFX11-NEXT:    scratch_store_b128 off, v[36:39], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 32
-; GFX11-NEXT:    v_mov_b32_e32 v30, v46
+; GFX11-NEXT:    v_mov_b32_e32 v16, v27
 ; GFX11-NEXT:    scratch_store_b128 off, v[48:51], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 16
+; GFX11-NEXT:    v_mov_b32_e32 v30, v46
 ; GFX11-NEXT:    scratch_store_b128 off, v[32:35], s2
-; GFX11-NEXT:    scratch_load_b128 v[1:4], off, s33 offset:1584 ; 16-byte Folded Reload
-; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mov_b32_e32 v1, 42
-; GFX11-NEXT:    s_clause 0x2
+; GFX11-NEXT:    s_clause 0x3
+; GFX11-NEXT:    scratch_load_b128 v[1:4], off, s33 offset:1584
 ; GFX11-NEXT:    scratch_load_b128 v[17:20], off, s33 offset:1568
 ; GFX11-NEXT:    scratch_load_b128 v[21:24], off, s33 offset:1552
 ; GFX11-NEXT:    scratch_load_b128 v[25:28], off, s33 offset:1536
 ; GFX11-NEXT:    s_add_i32 s2, s33, 0x400
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, s2
+; GFX11-NEXT:    v_dual_mov_b32 v31, v47 :: v_dual_mov_b32 v0, s2
+; GFX11-NEXT:    s_waitcnt vmcnt(3)
+; GFX11-NEXT:    v_mov_b32_e32 v1, 42
 ; GFX11-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-NEXT:    s_clause 0xb
 ; GFX11-NEXT:    scratch_load_b32 v59, off, s33
diff --git a/llvm/test/CodeGen/AMDGPU/global_atomics.ll b/llvm/test/CodeGen/AMDGPU/global_atomics.ll
index 3e15b135eeab9..b7b69ed9f53ba 100644
--- a/llvm/test/CodeGen/AMDGPU/global_atomics.ll
+++ b/llvm/test/CodeGen/AMDGPU/global_atomics.ll
@@ -4276,14 +4276,14 @@ define amdgpu_kernel void @atomic_cmpxchg_i32_addr64_offset(ptr addrspace(1) %ou
 ; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x34
 ; GFX9-NEXT:    s_load_dwordx2 s[2:3], s[4:5], 0x24
 ; GFX9-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX9-NEXT:    s_load_dword s7, s[4:5], 0x3c
 ; GFX9-NEXT:    v_mov_b32_e32 v2, 0
-; GFX9-NEXT:    s_load_dword s4, s[4:5], 0x3c
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    s_lshl_b64 s[0:1], s[0:1], 2
 ; GFX9-NEXT:    s_add_u32 s0, s2, s0
 ; GFX9-NEXT:    s_addc_u32 s1, s3, s1
 ; GFX9-NEXT:    v_mov_b32_e32 v0, s6
-; GFX9-NEXT:    v_mov_b32_e32 v1, s4
+; GFX9-NEXT:    v_mov_b32_e32 v1, s7
 ; GFX9-NEXT:    global_atomic_cmpswap v2, v[0:1], s[0:1] offset:16
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_wbinvl1_vol
@@ -4523,14 +4523,14 @@ define amdgpu_kernel void @atomic_cmpxchg_i32_addr64(ptr addrspace(1) %out, i32
 ; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x34
 ; GFX9-NEXT:    s_load_dwordx2 s[2:3], s[4:5], 0x24
 ; GFX9-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX9-NEXT:    s_load_dword s7, s[4:5], 0x3c
 ; GFX9-NEXT:    v_mov_b32_e32 v2, 0
-; GFX9-NEXT:    s_load_dword s4, s[4:5], 0x3c
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    s_lshl_b64 s[0:1], s[0:1], 2
 ; GFX9-NEXT:    s_add_u32 s0, s2, s0
 ; GFX9-NEXT:    s_addc_u32 s1, s3, s1
 ; GFX9-NEXT:    v_mov_b32_e32 v0, s6
-; GFX9-NEXT:    v_mov_b32_e32 v1, s4
+; GFX9-NEXT:    v_mov_b32_e32 v1, s7
 ; GFX9-NEXT:    global_atomic_cmpswap v2, v[0:1], s[0:1]
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_wbinvl1_vol
diff --git a/llvm/test/CodeGen/AMDGPU/half.ll b/llvm/test/CodeGen/AMDGPU/half.ll
index f767511370eee..117cf40de72d2 100644
--- a/llvm/test/CodeGen/AMDGPU/half.ll
+++ b/llvm/test/CodeGen/AMDGPU/half.ll
@@ -2952,8 +2952,8 @@ define amdgpu_kernel void @global_truncstore_v16f32_to_v16f16(ptr addrspace(1) %
 ; CI-NEXT:    s_add_u32 s2, s2, 16
 ; CI-NEXT:    flat_load_dwordx4 v[0:3], v[0:1]
 ; CI-NEXT:    v_mov_b32_e32 v5, s5
-; CI-NEXT:    flat_load_dwordx4 v[4:7], v[4:5]
 ; CI-NEXT:    s_addc_u32 s3, s3, 0
+; CI-NEXT:    flat_load_dwordx4 v[4:7], v[4:5]
 ; CI-NEXT:    v_mov_b32_e32 v13, s3
 ; CI-NEXT:    v_mov_b32_e32 v12, s2
 ; CI-NEXT:    flat_load_dwordx4 v[8:11], v[8:9]
diff --git a/llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll b/llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll
index e82801eadc936..0dfeb3454dad5 100644
--- a/llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll
@@ -245,11 +245,11 @@ define <2 x bfloat> @v_uitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
-; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
+; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, v0.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v1, v3, v5, vcc_lo
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
@@ -273,8 +273,8 @@ define <2 x bfloat> @v_uitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
+; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
@@ -305,12 +305,12 @@ define <2 x bfloat> @v_uitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX12-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX12-TRUE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
+; GFX12-TRUE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-TRUE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
-; GFX12-TRUE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX12-TRUE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
-; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
 ; GFX12-TRUE16-NEXT:    v_mov_b16_e32 v0.l, v0.h
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-TRUE16-NEXT:    v_cndmask_b32_e32 v1, v3, v5, vcc_lo
@@ -341,9 +341,9 @@ define <2 x bfloat> @v_uitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX12-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX12-FAKE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
+; GFX12-FAKE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
-; GFX12-FAKE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX12-FAKE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
@@ -591,8 +591,8 @@ define <3 x bfloat> @v_uitofp_v3i1_to_v3bf16(<3 x i1> %num) {
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e64 v1, 0, 1.0, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v5, v1, 16, 1
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v3, v7, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
+; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v3, v7, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v5, v5, v1, 0x7fff
@@ -682,9 +682,9 @@ define <3 x bfloat> @v_uitofp_v3i1_to_v3bf16(<3 x i1> %num) {
 ; GFX12-FAKE16-NEXT:    v_cndmask_b32_e64 v1, 0, 1.0, vcc_lo
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX12-FAKE16-NEXT:    v_bfe_u32 v5, v1, 16, 1
+; GFX12-FAKE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v3, v7, vcc_lo
-; GFX12-FAKE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX12-FAKE16-NEXT:    v_add3_u32 v5, v5, v1, 0x7fff
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
@@ -1587,11 +1587,11 @@ define <2 x bfloat> @v_sitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_4)
 ; GFX11-TRUE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
-; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
+; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-TRUE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
-; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, v0.h
 ; GFX11-TRUE16-NEXT:    v_cndmask_b32_e32 v1, v3, v5, vcc_lo
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
@@ -1615,8 +1615,8 @@ define <2 x bfloat> @v_sitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_4)
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
+; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
@@ -1647,12 +1647,12 @@ define <2 x bfloat> @v_sitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX12-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX12-TRUE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
+; GFX12-TRUE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-TRUE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
-; GFX12-TRUE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-TRUE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX12-TRUE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
-; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
 ; GFX12-TRUE16-NEXT:    v_mov_b16_e32 v0.l, v0.h
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-TRUE16-NEXT:    v_cndmask_b32_e32 v1, v3, v5, vcc_lo
@@ -1683,9 +1683,9 @@ define <2 x bfloat> @v_sitofp_v2i1_to_v2bf16(<2 x i1> %num) {
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX12-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX12-FAKE16-NEXT:    v_bfe_u32 v3, v1, 16, 1
+; GFX12-FAKE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v2, v4, vcc_lo
-; GFX12-FAKE16-NEXT:    v_or_b32_e32 v5, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX12-FAKE16-NEXT:    v_add3_u32 v3, v3, v1, 0x7fff
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
@@ -1935,8 +1935,8 @@ define <3 x bfloat> @v_sitofp_v3i1_to_v3bf16(<3 x i1> %num) {
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e64 v1, 0, -1.0, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX11-FAKE16-NEXT:    v_bfe_u32 v5, v1, 16, 1
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v3, v7, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
+; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v3, v7, vcc_lo
 ; GFX11-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_add3_u32 v5, v5, v1, 0x7fff
@@ -2026,9 +2026,9 @@ define <3 x bfloat> @v_sitofp_v3i1_to_v3bf16(<3 x i1> %num) {
 ; GFX12-FAKE16-NEXT:    v_cndmask_b32_e64 v1, 0, -1.0, vcc_lo
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v0, v0
 ; GFX12-FAKE16-NEXT:    v_bfe_u32 v5, v1, 16, 1
+; GFX12-FAKE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
 ; GFX12-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v3, v7, vcc_lo
-; GFX12-FAKE16-NEXT:    v_or_b32_e32 v8, 0x400000, v1
 ; GFX12-FAKE16-NEXT:    v_cmp_u_f32_e32 vcc_lo, v1, v1
 ; GFX12-FAKE16-NEXT:    v_add3_u32 v5, v5, v1, 0x7fff
 ; GFX12-FAKE16-NEXT:    s_wait_alu 0xfffd
diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
index a71995a798410..ecbf5dfeb3af1 100644
--- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
+++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
@@ -959,11 +959,10 @@ define amdgpu_kernel void @sdiv16_invariant_denom(ptr addrspace(1) nocapture %ar
 ; GFX11-NEXT:    s_and_b32 s5, s5, exec_lo
 ; GFX11-NEXT:    s_cselect_b32 s4, s4, 0
 ; GFX11-NEXT:    s_and_b32 s5, 0xffff, s3
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
-; GFX11-NEXT:    v_add_nc_u32_e32 v2, s4, v2
-; GFX11-NEXT:    s_lshl_b32 s5, s5, 1
 ; GFX11-NEXT:    s_add_i32 s3, s3, 1
-; GFX11-NEXT:    v_mov_b32_e32 v3, s5
+; GFX11-NEXT:    s_lshl_b32 s5, s5, 1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    v_dual_mov_b32 v3, s5 :: v_dual_add_nc_u32 v2, s4, v2
 ; GFX11-NEXT:    s_and_b32 s4, s3, 0xffff
 ; GFX11-NEXT:    s_cmpk_eq_i32 s4, 0x400
 ; GFX11-NEXT:    global_store_b16 v3, v2, s[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/idot4s.ll b/llvm/test/CodeGen/AMDGPU/idot4s.ll
index d28f0a190e117..9e7968f1acb84 100644
--- a/llvm/test/CodeGen/AMDGPU/idot4s.ll
+++ b/llvm/test/CodeGen/AMDGPU/idot4s.ll
@@ -2903,13 +2903,13 @@ define amdgpu_kernel void @idot4_acc32_3src_3ele_src0(ptr addrspace(1) %src1,
 ; GFX9-DL-NEXT:    s_mov_b32 s1, 0xc0c0c01
 ; GFX9-DL-NEXT:    s_mov_b32 s2, 0xc020101
 ; GFX9-DL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-DL-NEXT:    global_load_dword v1, v0, s[12:13]
-; GFX9-DL-NEXT:    global_load_dword v2, v0, s[8:9]
+; GFX9-DL-NEXT:    global_load_dword v1, v0, s[8:9]
+; GFX9-DL-NEXT:    global_load_dword v2, v0, s[12:13]
 ; GFX9-DL-NEXT:    global_load_dword v3, v0, s[10:11]
 ; GFX9-DL-NEXT:    s_load_dword s3, s[14:15], 0x0
 ; GFX9-DL-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX9-DL-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-DL-NEXT:    v_perm_b32 v1, v1, v2, s0
+; GFX9-DL-NEXT:    v_perm_b32 v1, v2, v1, s0
 ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-DL-NEXT:    v_perm_b32 v2, v3, v3, s1
 ; GFX9-DL-NEXT:    v_or_b32_e32 v1, v1, v2
@@ -2925,12 +2925,12 @@ define amdgpu_kernel void @idot4_acc32_3src_3ele_src0(ptr addrspace(1) %src1,
 ; GFX10-DL-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-DL-NEXT:    s_clause 0x2
-; GFX10-DL-NEXT:    global_load_dword v1, v0, s[12:13]
-; GFX10-DL-NEXT:    global_load_dword v2, v0, s[8:9]
+; GFX10-DL-NEXT:    global_load_dword v1, v0, s[8:9]
+; GFX10-DL-NEXT:    global_load_dword v2, v0, s[12:13]
 ; GFX10-DL-NEXT:    global_load_dword v3, v0, s[10:11]
 ; GFX10-DL-NEXT:    s_load_dword s0, s[14:15], 0x0
 ; GFX10-DL-NEXT:    s_waitcnt vmcnt(1)
-; GFX10-DL-NEXT:    v_perm_b32 v0, v1, v2, 0xc06010c
+; GFX10-DL-NEXT:    v_perm_b32 v0, v2, v1, 0xc06010c
 ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-DL-NEXT:    v_perm_b32 v1, v3, v3, 0xc0c0c01
 ; GFX10-DL-NEXT:    v_perm_b32 v2, v3, v3, 0xc020101
@@ -2950,12 +2950,12 @@ define amdgpu_kernel void @idot4_acc32_3src_3ele_src0(ptr addrspace(1) %src1,
 ; GFX11-DL-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX11-DL-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-DL-NEXT:    s_clause 0x2
-; GFX11-DL-NEXT:    global_load_b32 v1, v0, s[4:5]
-; GFX11-DL-NEXT:    global_load_b32 v2, v0, s[0:1]
+; GFX11-DL-NEXT:    global_load_b32 v1, v0, s[0:1]
+; GFX11-DL-NEXT:    global_load_b32 v2, v0, s[4:5]
 ; GFX11-DL-NEXT:    global_load_b32 v0, v0, s[2:3]
 ; GFX11-DL-NEXT:    s_load_b32 s0, s[6:7], 0x0
 ; GFX11-DL-NEXT:    s_waitcnt vmcnt(1)
-; GFX11-DL-NEXT:    v_perm_b32 v1, v1, v2, 0xc06010c
+; GFX11-DL-NEXT:    v_perm_b32 v1, v2, v1, 0xc06010c
 ; GFX11-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-DL-NEXT:    v_perm_b32 v2, v0, v0, 0xc0c0c01
 ; GFX11-DL-NEXT:    v_perm_b32 v0, v0, v0, 0xc020101
diff --git a/llvm/test/CodeGen/AMDGPU/idot4u.ll b/llvm/test/CodeGen/AMDGPU/idot4u.ll
index 82d62910bcb00..f995f426c6372 100644
--- a/llvm/test/CodeGen/AMDGPU/idot4u.ll
+++ b/llvm/test/CodeGen/AMDGPU/idot4u.ll
@@ -1451,8 +1451,8 @@ define amdgpu_kernel void @udot4_multiuse_add1(ptr addrspace(1) %src1,
 ; GFX11-DL-NEXT:    v_bfe_u32 v3, v0, 8, 8
 ; GFX11-DL-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-DL-NEXT:    v_dot4_u32_u8 v0, v1, v0, s0
-; GFX11-DL-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-DL-NEXT:    s_add_i32 s0, s0, s0
+; GFX11-DL-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-DL-NEXT:    v_mul_u32_u24_e32 v2, v2, v3
 ; GFX11-DL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-DL-NEXT:    v_add3_u32 v0, s0, v2, v0
@@ -4454,13 +4454,13 @@ define amdgpu_kernel void @udot4_acc32_3src_3ele_src0(ptr addrspace(1) %src1,
 ; GFX9-DL-NEXT:    s_mov_b32 s1, 0xc0c0c01
 ; GFX9-DL-NEXT:    s_mov_b32 s2, 0xc020101
 ; GFX9-DL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-DL-NEXT:    global_load_dword v1, v0, s[12:13]
-; GFX9-DL-NEXT:    global_load_dword v2, v0, s[8:9]
+; GFX9-DL-NEXT:    global_load_dword v1, v0, s[8:9]
+; GFX9-DL-NEXT:    global_load_dword v2, v0, s[12:13]
 ; GFX9-DL-NEXT:    global_load_dword v3, v0, s[10:11]
 ; GFX9-DL-NEXT:    s_load_dword s3, s[14:15], 0x0
 ; GFX9-DL-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX9-DL-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-DL-NEXT:    v_perm_b32 v1, v1, v2, s0
+; GFX9-DL-NEXT:    v_perm_b32 v1, v2, v1, s0
 ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-DL-NEXT:    v_perm_b32 v2, v3, v3, s1
 ; GFX9-DL-NEXT:    v_or_b32_e32 v1, v1, v2
@@ -4476,12 +4476,12 @@ define amdgpu_kernel void @udot4_acc32_3src_3ele_src0(ptr addrspace(1) %src1,
 ; GFX10-DL-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-DL-NEXT:    s_clause 0x2
-; GFX10-DL-NEXT:    global_load_dword v1, v0, s[12:13]
-; GFX10-DL-NEXT:    global_load_dword v2, v0, s[8:9]
+; GFX10-DL-NEXT:    global_load_dword v1, v0, s[8:9]
+; GFX10-DL-NEXT:    global_load_dword v2, v0, s[12:13]
 ; GFX10-DL-NEXT:    global_load_dword v3, v0, s[10:11]
 ; GFX10-DL-NEXT:    s_load_dword s0, s[14:15], 0x0
 ; GFX10-DL-NEXT:    s_waitcnt vmcnt(1)
-; GFX10-DL-NEXT:    v_perm_b32 v0, v1, v2, 0xc06010c
+; GFX10-DL-NEXT:    v_perm_b32 v0, v2, v1, 0xc06010c
 ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-DL-NEXT:    v_perm_b32 v1, v3, v3, 0xc0c0c01
 ; GFX10-DL-NEXT:    v_mov_b32_e32 v2, 0
@@ -4500,12 +4500,12 @@ define amdgpu_kernel void @udot4_acc32_3src_3ele_src0(ptr addrspace(1) %src1,
 ; GFX11-DL-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX11-DL-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-DL-NEXT:    s_clause 0x2
-; GFX11-DL-NEXT:    global_load_b32 v1, v0, s[4:5]
-; GFX11-DL-NEXT:    global_load_b32 v2, v0, s[0:1]
+; GFX11-DL-NEXT:    global_load_b32 v1, v0, s[0:1]
+; GFX11-DL-NEXT:    global_load_b32 v2, v0, s[4:5]
 ; GFX11-DL-NEXT:    global_load_b32 v0, v0, s[2:3]
 ; GFX11-DL-NEXT:    s_load_b32 s0, s[6:7], 0x0
 ; GFX11-DL-NEXT:    s_waitcnt vmcnt(1)
-; GFX11-DL-NEXT:    v_perm_b32 v1, v1, v2, 0xc06010c
+; GFX11-DL-NEXT:    v_perm_b32 v1, v2, v1, 0xc06010c
 ; GFX11-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-DL-NEXT:    v_perm_b32 v2, v0, v0, 0xc0c0c01
 ; GFX11-DL-NEXT:    v_perm_b32 v0, v0, v0, 0xc020101
@@ -5919,10 +5919,10 @@ define amdgpu_kernel void @idot4_acc32_v16i8(ptr addrspace(1) %src1,
 ; GFX11-DL-NEXT:    global_load_b32 v0, v4, s[2:3]
 ; GFX11-DL-NEXT:    s_waitcnt vmcnt(1)
 ; GFX11-DL-NEXT:    v_perm_b32 v1, v3, v2, 0x7050002
-; GFX11-DL-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-DL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-DL-NEXT:    v_perm_b32 v0, v0, v0, 0x3020001
-; GFX11-DL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-DL-NEXT:    v_mov_b32_e32 v2, 0
+; GFX11-DL-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-DL-NEXT:    v_dot4_u32_u8 v0, v1, v0, 0
 ; GFX11-DL-NEXT:    global_store_b32 v2, v0, s[4:5]
 ; GFX11-DL-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll b/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll
index 6040cc47ad6f2..b5665835eaf7a 100644
--- a/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll
+++ b/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll
@@ -5763,13 +5763,13 @@ define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(ptr addrspace(1)
 ; GENERIC-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; GENERIC-NEXT:    v_cndmask_b32_e64 v17, 63, v17, s[0:1]
 ; GENERIC-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
-; GENERIC-NEXT:    v_mov_b32_e32 v15, s21
-; GENERIC-NEXT:    v_cmp_eq_u32_e32 vcc, 13, v14
-; GENERIC-NEXT:    v_cndmask_b32_e32 v15, v15, v1, vcc
-; GENERIC-NEXT:    v_cmp_ne_u32_e32 vcc, 13, v18
-; GENERIC-NEXT:    v_cndmask_b32_e32 v15, 63, v15, vcc
 ; GENERIC-NEXT:    v_mov_b32_e32 v19, s20
 ; GENERIC-NEXT:    v_cmp_eq_u32_e32 vcc, 12, v14
+; GENERIC-NEXT:    v_mov_b32_e32 v15, s21
+; GENERIC-NEXT:    v_cmp_eq_u32_e64 s[0:1], 13, v14
+; GENERIC-NEXT:    v_cndmask_b32_e64 v14, v15, v1, s[0:1]
+; GENERIC-NEXT:    v_cmp_ne_u32_e64 s[0:1], 13, v18
+; GENERIC-NEXT:    v_cndmask_b32_e64 v15, 63, v14, s[0:1]
 ; GENERIC-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x9
 ; GENERIC-NEXT:    s_mov_b32 s2, -1
 ; GENERIC-NEXT:    v_cndmask_b32_e32 v14, v19, v1, vcc
@@ -6319,19 +6319,19 @@ define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(ptr addrspace(1)
 ; SI-MOVREL-NEXT:    v_cmp_ne_u32_e32 vcc, 9, v18
 ; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v11, 63, v11, vcc
 ; SI-MOVREL-NEXT:    v_cmp_ne_u32_e32 vcc, 8, v18
+; SI-MOVREL-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v10, 63, v10, vcc
 ; SI-MOVREL-NEXT:    v_cmp_ne_u32_e32 vcc, 14, v18
-; SI-MOVREL-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; SI-MOVREL-NEXT:    v_cndmask_b32_e64 v17, 63, v17, s[0:1]
-; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
 ; SI-MOVREL-NEXT:    v_mov_b32_e32 v15, s21
-; SI-MOVREL-NEXT:    v_cmp_eq_u32_e32 vcc, 13, v14
+; SI-MOVREL-NEXT:    v_cmp_eq_u32_e64 s[0:1], 13, v14
+; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
+; SI-MOVREL-NEXT:    v_cmp_eq_u32_e32 vcc, 12, v14
+; SI-MOVREL-NEXT:    v_cndmask_b32_e64 v14, v15, v1, s[0:1]
+; SI-MOVREL-NEXT:    v_cmp_ne_u32_e64 s[0:1], 13, v18
+; SI-MOVREL-NEXT:    v_cndmask_b32_e64 v15, 63, v14, s[0:1]
 ; SI-MOVREL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x9
-; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v15, v15, v1, vcc
-; SI-MOVREL-NEXT:    v_cmp_ne_u32_e32 vcc, 13, v18
-; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v15, 63, v15, vcc
 ; SI-MOVREL-NEXT:    v_mov_b32_e32 v19, s20
-; SI-MOVREL-NEXT:    v_cmp_eq_u32_e32 vcc, 12, v14
 ; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v14, v19, v1, vcc
 ; SI-MOVREL-NEXT:    v_cmp_ne_u32_e32 vcc, 12, v18
 ; SI-MOVREL-NEXT:    v_cndmask_b32_e32 v14, 63, v14, vcc
@@ -6426,35 +6426,35 @@ define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(ptr addrspace(1)
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 15, v14
 ; VI-NEXT:    v_cndmask_b32_e32 v17, v11, v1, vcc
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 8, v14
-; VI-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; VI-NEXT:    v_mov_b32_e32 v16, s17
 ; VI-NEXT:    v_cndmask_b32_e32 v10, v15, v1, vcc
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 9, v14
-; VI-NEXT:    v_cndmask_b32_e64 v17, 63, v17, s[0:1]
-; VI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; VI-NEXT:    v_cndmask_b32_e32 v11, v16, v1, vcc
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 9, v18
 ; VI-NEXT:    v_cndmask_b32_e32 v11, 63, v11, vcc
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 8, v18
+; VI-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; VI-NEXT:    v_cndmask_b32_e32 v10, 63, v10, vcc
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 14, v18
-; VI-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
+; VI-NEXT:    v_cndmask_b32_e64 v17, 63, v17, s[0:1]
 ; VI-NEXT:    v_mov_b32_e32 v15, s21
-; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 13, v14
-; VI-NEXT:    v_cndmask_b32_e32 v15, v15, v1, vcc
-; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 13, v18
-; VI-NEXT:    s_waitcnt lgkmcnt(0)
-; VI-NEXT:    s_add_u32 s2, s0, 48
-; VI-NEXT:    v_cndmask_b32_e32 v15, 63, v15, vcc
-; VI-NEXT:    v_mov_b32_e32 v19, s20
+; VI-NEXT:    v_cmp_eq_u32_e64 s[0:1], 13, v14
+; VI-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 12, v14
-; VI-NEXT:    s_addc_u32 s3, s1, 0
+; VI-NEXT:    v_cndmask_b32_e64 v14, v15, v1, s[0:1]
+; VI-NEXT:    v_cmp_ne_u32_e64 s[0:1], 13, v18
+; VI-NEXT:    v_cndmask_b32_e64 v15, 63, v14, s[0:1]
+; VI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; VI-NEXT:    v_mov_b32_e32 v19, s20
 ; VI-NEXT:    v_cndmask_b32_e32 v14, v19, v1, vcc
 ; VI-NEXT:    v_cmp_ne_u32_e32 vcc, 12, v18
+; VI-NEXT:    v_cndmask_b32_e32 v14, 63, v14, vcc
+; VI-NEXT:    s_waitcnt lgkmcnt(0)
+; VI-NEXT:    s_add_u32 s2, s0, 48
+; VI-NEXT:    s_addc_u32 s3, s1, 0
 ; VI-NEXT:    v_mov_b32_e32 v19, s3
 ; VI-NEXT:    v_mov_b32_e32 v18, s2
 ; VI-NEXT:    s_add_u32 s2, s0, 32
-; VI-NEXT:    v_cndmask_b32_e32 v14, 63, v14, vcc
 ; VI-NEXT:    s_addc_u32 s3, s1, 0
 ; VI-NEXT:    flat_store_dwordx4 v[18:19], v[14:17]
 ; VI-NEXT:    s_waitcnt vmcnt(0)
@@ -6558,19 +6558,19 @@ define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(ptr addrspace(1)
 ; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e32 vcc, 9, v18
 ; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v11, 63, v11, vcc
 ; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e32 vcc, 8, v18
+; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v10, 63, v10, vcc
 ; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e32 vcc, 14, v18
-; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e64 s[0:1], 15, v18
 ; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e64 v17, 63, v17, s[0:1]
-; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
 ; GFX9-IDXMODE-NEXT:    v_mov_b32_e32 v15, s21
-; GFX9-IDXMODE-NEXT:    v_cmp_eq_u32_e32 vcc, 13, v14
+; GFX9-IDXMODE-NEXT:    v_cmp_eq_u32_e64 s[0:1], 13, v14
+; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v16, 63, v19, vcc
+; GFX9-IDXMODE-NEXT:    v_cmp_eq_u32_e32 vcc, 12, v14
+; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e64 v14, v15, v1, s[0:1]
+; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e64 s[0:1], 13, v18
+; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e64 v15, 63, v14, s[0:1]
 ; GFX9-IDXMODE-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v15, v15, v1, vcc
-; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e32 vcc, 13, v18
-; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v15, 63, v15, vcc
 ; GFX9-IDXMODE-NEXT:    v_mov_b32_e32 v19, s20
-; GFX9-IDXMODE-NEXT:    v_cmp_eq_u32_e32 vcc, 12, v14
 ; GFX9-IDXMODE-NEXT:    v_cndmask_b32_e32 v14, v19, v1, vcc
 ; GFX9-IDXMODE-NEXT:    v_cmp_ne_u32_e32 vcc, 12, v18
 ; GFX9-IDXMODE-NEXT:    v_mov_b32_e32 v18, 0
diff --git a/llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll b/llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll
index 837c18fe7af0a..be16fac4c53f7 100644
--- a/llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll
+++ b/llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll
@@ -733,23 +733,23 @@ define amdgpu_kernel void @dynamic_insertelement_v9f32(ptr addrspace(1) %out, <9
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_load_dwordx8 s[0:7], s[8:9], 0x40
 ; VI-NEXT:    s_load_dwordx2 s[12:13], s[8:9], 0x0
+; VI-NEXT:    s_load_dword s10, s[8:9], 0x60
 ; VI-NEXT:    v_mov_b32_e32 v9, 0x40a00000
 ; VI-NEXT:    s_mov_b32 s15, 0x1100f000
 ; VI-NEXT:    s_mov_b32 s14, -1
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s0
+; VI-NEXT:    s_load_dword s0, s[8:9], 0x80
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
-; VI-NEXT:    s_load_dword s0, s[8:9], 0x60
-; VI-NEXT:    s_load_dword s1, s[8:9], 0x80
 ; VI-NEXT:    v_mov_b32_e32 v2, s2
 ; VI-NEXT:    v_mov_b32_e32 v3, s3
 ; VI-NEXT:    v_mov_b32_e32 v4, s4
 ; VI-NEXT:    v_mov_b32_e32 v5, s5
 ; VI-NEXT:    v_mov_b32_e32 v6, s6
 ; VI-NEXT:    v_mov_b32_e32 v7, s7
+; VI-NEXT:    v_mov_b32_e32 v8, s10
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
-; VI-NEXT:    v_mov_b32_e32 v8, s0
-; VI-NEXT:    s_mov_b32 m0, s1
+; VI-NEXT:    s_mov_b32 m0, s0
 ; VI-NEXT:    v_movreld_b32_e32 v0, v9
 ; VI-NEXT:    buffer_store_dword v8, off, s[12:15], 0 offset:32
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[12:15], 0 offset:16
@@ -791,10 +791,11 @@ define amdgpu_kernel void @dynamic_insertelement_v10f32(ptr addrspace(1) %out, <
 ; VI-LABEL: dynamic_insertelement_v10f32:
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_load_dwordx8 s[12:19], s[8:9], 0x40
+; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x60
 ; VI-NEXT:    s_load_dword s6, s[8:9], 0x80
-; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    v_mov_b32_e32 v10, 0x40a00000
+; VI-NEXT:    s_mov_b32 s3, 0x1100f000
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s12
 ; VI-NEXT:    v_mov_b32_e32 v1, s13
@@ -807,7 +808,6 @@ define amdgpu_kernel void @dynamic_insertelement_v10f32(ptr addrspace(1) %out, <
 ; VI-NEXT:    v_mov_b32_e32 v8, s4
 ; VI-NEXT:    v_mov_b32_e32 v9, s5
 ; VI-NEXT:    s_mov_b32 m0, s6
-; VI-NEXT:    s_mov_b32 s3, 0x1100f000
 ; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_movreld_b32_e32 v0, v10
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[0:3], 0 offset:16
@@ -851,12 +851,13 @@ define amdgpu_kernel void @dynamic_insertelement_v11f32(ptr addrspace(1) %out, <
 ;
 ; VI-LABEL: dynamic_insertelement_v11f32:
 ; VI:       ; %bb.0:
+; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x60
 ; VI-NEXT:    s_load_dwordx8 s[12:19], s[8:9], 0x40
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_load_dword s7, s[8:9], 0x80
-; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    v_mov_b32_e32 v11, 0x40a00000
+; VI-NEXT:    s_mov_b32 s3, 0x1100f000
 ; VI-NEXT:    v_mov_b32_e32 v8, s4
 ; VI-NEXT:    v_mov_b32_e32 v0, s12
 ; VI-NEXT:    v_mov_b32_e32 v1, s13
@@ -870,7 +871,6 @@ define amdgpu_kernel void @dynamic_insertelement_v11f32(ptr addrspace(1) %out, <
 ; VI-NEXT:    v_mov_b32_e32 v10, s6
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_mov_b32 m0, s7
-; VI-NEXT:    s_mov_b32 s3, 0x1100f000
 ; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_movreld_b32_e32 v0, v11
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[0:3], 0 offset:16
@@ -919,6 +919,7 @@ define amdgpu_kernel void @dynamic_insertelement_v12f32(ptr addrspace(1) %out, <
 ; VI-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x60
 ; VI-NEXT:    s_load_dword s8, s[8:9], 0x80
 ; VI-NEXT:    v_mov_b32_e32 v12, 0x40a00000
+; VI-NEXT:    s_mov_b32 s3, 0x1100f000
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s12
 ; VI-NEXT:    v_mov_b32_e32 v1, s13
@@ -933,7 +934,6 @@ define amdgpu_kernel void @dynamic_insertelement_v12f32(ptr addrspace(1) %out, <
 ; VI-NEXT:    v_mov_b32_e32 v10, s6
 ; VI-NEXT:    v_mov_b32_e32 v11, s7
 ; VI-NEXT:    s_mov_b32 m0, s8
-; VI-NEXT:    s_mov_b32 s3, 0x1100f000
 ; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_movreld_b32_e32 v0, v12
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[0:3], 0 offset:16
@@ -1101,14 +1101,14 @@ define amdgpu_kernel void @dynamic_insertelement_v3i32(ptr addrspace(1) %out, <3
 define amdgpu_kernel void @dynamic_insertelement_v4i32(ptr addrspace(1) %out, <4 x i32> %a, i32 %b, [8 x i32], i32 %val) nounwind {
 ; SI-LABEL: dynamic_insertelement_v4i32:
 ; SI:       ; %bb.0:
-; SI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x4
 ; SI-NEXT:    s_load_dword s10, s[8:9], 0x8
 ; SI-NEXT:    s_load_dword s11, s[8:9], 0x11
+; SI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x4
 ; SI-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x0
 ; SI-NEXT:    s_mov_b32 s7, 0x100f000
-; SI-NEXT:    s_mov_b32 s6, -1
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_cmp_eq_u32 s10, 3
+; SI-NEXT:    s_mov_b32 s6, -1
 ; SI-NEXT:    s_cselect_b32 s3, s11, s3
 ; SI-NEXT:    s_cmp_eq_u32 s10, 2
 ; SI-NEXT:    s_cselect_b32 s2, s11, s2
@@ -1125,14 +1125,14 @@ define amdgpu_kernel void @dynamic_insertelement_v4i32(ptr addrspace(1) %out, <4
 ;
 ; VI-LABEL: dynamic_insertelement_v4i32:
 ; VI:       ; %bb.0:
-; VI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x10
 ; VI-NEXT:    s_load_dword s10, s[8:9], 0x20
 ; VI-NEXT:    s_load_dword s11, s[8:9], 0x44
+; VI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x10
 ; VI-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x0
 ; VI-NEXT:    s_mov_b32 s7, 0x1100f000
-; VI-NEXT:    s_mov_b32 s6, -1
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_cmp_eq_u32 s10, 3
+; VI-NEXT:    s_mov_b32 s6, -1
 ; VI-NEXT:    s_cselect_b32 s3, s11, s3
 ; VI-NEXT:    s_cmp_eq_u32 s10, 2
 ; VI-NEXT:    s_cselect_b32 s2, s11, s2
@@ -1286,10 +1286,11 @@ define amdgpu_kernel void @dynamic_insertelement_v10i32(ptr addrspace(1) %out, <
 ; VI-LABEL: dynamic_insertelement_v10i32:
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_load_dwordx8 s[12:19], s[8:9], 0x40
+; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x60
 ; VI-NEXT:    s_load_dword s6, s[8:9], 0x80
-; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    s_mov_b32 s3, 0x1100f000
+; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s12
 ; VI-NEXT:    v_mov_b32_e32 v1, s13
@@ -1302,7 +1303,6 @@ define amdgpu_kernel void @dynamic_insertelement_v10i32(ptr addrspace(1) %out, <
 ; VI-NEXT:    v_mov_b32_e32 v8, s4
 ; VI-NEXT:    v_mov_b32_e32 v9, s5
 ; VI-NEXT:    s_mov_b32 m0, s6
-; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_movreld_b32_e32 v0, 5
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[0:3], 0 offset:16
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[0:3], 0
@@ -1344,12 +1344,13 @@ define amdgpu_kernel void @dynamic_insertelement_v11i32(ptr addrspace(1) %out, <
 ;
 ; VI-LABEL: dynamic_insertelement_v11i32:
 ; VI:       ; %bb.0:
+; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x60
 ; VI-NEXT:    s_load_dwordx8 s[12:19], s[8:9], 0x40
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_load_dword s7, s[8:9], 0x80
-; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; VI-NEXT:    s_mov_b32 s3, 0x1100f000
+; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_mov_b32_e32 v8, s4
 ; VI-NEXT:    v_mov_b32_e32 v0, s12
 ; VI-NEXT:    v_mov_b32_e32 v1, s13
@@ -1363,7 +1364,6 @@ define amdgpu_kernel void @dynamic_insertelement_v11i32(ptr addrspace(1) %out, <
 ; VI-NEXT:    v_mov_b32_e32 v10, s6
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_mov_b32 m0, s7
-; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_movreld_b32_e32 v0, 5
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[0:3], 0 offset:16
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[0:3], 0
@@ -1410,6 +1410,7 @@ define amdgpu_kernel void @dynamic_insertelement_v12i32(ptr addrspace(1) %out, <
 ; VI-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x60
 ; VI-NEXT:    s_load_dword s8, s[8:9], 0x80
 ; VI-NEXT:    s_mov_b32 s3, 0x1100f000
+; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s12
 ; VI-NEXT:    v_mov_b32_e32 v1, s13
@@ -1424,7 +1425,6 @@ define amdgpu_kernel void @dynamic_insertelement_v12i32(ptr addrspace(1) %out, <
 ; VI-NEXT:    v_mov_b32_e32 v10, s6
 ; VI-NEXT:    v_mov_b32_e32 v11, s7
 ; VI-NEXT:    s_mov_b32 m0, s8
-; VI-NEXT:    s_mov_b32 s2, -1
 ; VI-NEXT:    v_movreld_b32_e32 v0, 5
 ; VI-NEXT:    buffer_store_dwordx4 v[4:7], off, s[0:3], 0 offset:16
 ; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[0:3], 0
diff --git a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
index 81ef7351b84e9..678d06e969276 100644
--- a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
@@ -6385,16 +6385,16 @@ define <2 x i64> @clpeak_imad_pat_v2i64(<2 x i64> %x, <2 x i64> %y) {
 ; GFX1200-GISEL-NEXT:    v_add_co_ci_u32_e64 v2, null, 0, v2, vcc_lo
 ; GFX1200-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX1200-GISEL-NEXT:    v_mul_hi_u32 v1, v10, v9
-; GFX1200-GISEL-NEXT:    v_mul_lo_u32 v14, v10, v9
+; GFX1200-GISEL-NEXT:    v_mul_lo_u32 v15, v10, v9
 ; GFX1200-GISEL-NEXT:    v_add_co_u32 v12, vcc_lo, v7, 1
 ; GFX1200-GISEL-NEXT:    s_wait_alu 0xfffd
 ; GFX1200-GISEL-NEXT:    v_add_co_ci_u32_e64 v13, null, 0, v3, vcc_lo
-; GFX1200-GISEL-NEXT:    v_add_co_u32 v15, vcc_lo, v10, 1
+; GFX1200-GISEL-NEXT:    v_add_co_u32 v14, vcc_lo, v10, 1
 ; GFX1200-GISEL-NEXT:    v_mul_lo_u32 v11, v7, v8
 ; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[5:6], null, v7, v5, v[0:1]
 ; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[1:2], null, v10, v2, v[1:2]
 ; GFX1200-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX1200-GISEL-NEXT:    v_mul_hi_u32 v2, v14, v15
+; GFX1200-GISEL-NEXT:    v_mul_hi_u32 v2, v15, v14
 ; GFX1200-GISEL-NEXT:    s_wait_alu 0xfffd
 ; GFX1200-GISEL-NEXT:    v_add_co_ci_u32_e64 v10, null, 0, v4, vcc_lo
 ; GFX1200-GISEL-NEXT:    v_mul_hi_u32 v0, v11, v12
@@ -6403,11 +6403,11 @@ define <2 x i64> @clpeak_imad_pat_v2i64(<2 x i64> %x, <2 x i64> %y) {
 ; GFX1200-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
 ; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[6:7], null, v11, v13, v[0:1]
 ; GFX1200-GISEL-NEXT:    v_mul_lo_u32 v0, v11, v12
-; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[7:8], null, v14, v10, v[2:3]
+; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[7:8], null, v15, v10, v[2:3]
 ; GFX1200-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
 ; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[1:2], null, v5, v12, v[6:7]
-; GFX1200-GISEL-NEXT:    v_mul_lo_u32 v2, v14, v15
-; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[3:4], null, v3, v15, v[7:8]
+; GFX1200-GISEL-NEXT:    v_mul_lo_u32 v2, v15, v14
+; GFX1200-GISEL-NEXT:    v_mad_co_u64_u32 v[3:4], null, v3, v14, v[7:8]
 ; GFX1200-GISEL-NEXT:    s_setpc_b64 s[30:31]
 entry:
   %y18 = add <2 x i64> %x, <i64 1, i64 1>
diff --git a/llvm/test/CodeGen/AMDGPU/kernel-args.ll b/llvm/test/CodeGen/AMDGPU/kernel-args.ll
index 9df995b5a7066..a18b5b5396f63 100644
--- a/llvm/test/CodeGen/AMDGPU/kernel-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/kernel-args.ll
@@ -1664,8 +1664,8 @@ entry:
 define amdgpu_kernel void @v5i16_arg(ptr addrspace(1) nocapture %out, <5 x i16> %in) nounwind {
 ; SI-LABEL: v5i16_arg:
 ; SI:       ; %bb.0: ; %entry
-; SI-NEXT:    s_load_dword s6, s[4:5], 0xf
 ; SI-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x9
+; SI-NEXT:    s_load_dword s6, s[4:5], 0xf
 ; SI-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0xd
 ; SI-NEXT:    s_mov_b32 s3, 0xf000
 ; SI-NEXT:    s_mov_b32 s2, -1
@@ -5191,16 +5191,16 @@ define amdgpu_kernel void @array_3xi16(i8 %arg0, [3 x i16] %arg1) {
 ; VI-NEXT:    s_addc_u32 s1, s5, 0
 ; VI-NEXT:    s_add_u32 s2, s0, 2
 ; VI-NEXT:    s_addc_u32 s3, s1, 0
-; VI-NEXT:    v_mov_b32_e32 v0, s0
-; VI-NEXT:    v_mov_b32_e32 v1, s1
-; VI-NEXT:    s_add_u32 s0, s4, 42
-; VI-NEXT:    s_addc_u32 s1, s5, 0
 ; VI-NEXT:    v_mov_b32_e32 v3, s1
 ; VI-NEXT:    v_mov_b32_e32 v2, s0
-; VI-NEXT:    flat_load_ushort v4, v[0:1]
-; VI-NEXT:    flat_load_ushort v2, v[2:3]
+; VI-NEXT:    s_add_u32 s0, s4, 42
+; VI-NEXT:    s_addc_u32 s1, s5, 0
+; VI-NEXT:    v_mov_b32_e32 v5, s1
 ; VI-NEXT:    v_mov_b32_e32 v0, s2
+; VI-NEXT:    v_mov_b32_e32 v4, s0
 ; VI-NEXT:    v_mov_b32_e32 v1, s3
+; VI-NEXT:    flat_load_ushort v4, v[4:5]
+; VI-NEXT:    flat_load_ushort v2, v[2:3]
 ; VI-NEXT:    flat_load_ushort v0, v[0:1]
 ; VI-NEXT:    s_load_dword s0, s[4:5], 0x24
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
@@ -5208,10 +5208,10 @@ define amdgpu_kernel void @array_3xi16(i8 %arg0, [3 x i16] %arg1) {
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    flat_store_byte v[0:1], v1
 ; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    flat_store_short v[0:1], v2
-; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    flat_store_short v[0:1], v4
 ; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    flat_store_short v[0:1], v2
+; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    flat_store_short v[0:1], v0
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll
index 2afa9ba14ceae..968c198fb6239 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll
@@ -35,7 +35,7 @@ define amdgpu_kernel void @global_atomic_ordered_add_b64_rtn(ptr addrspace(1) %a
 ; GFX12-SDAG-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX12-SDAG-NEXT:    s_load_b64 s[4:5], s[4:5], 0x34
 ; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
-; GFX12-SDAG-NEXT:    v_dual_mov_b32 v1, s3 :: v_dual_mov_b32 v0, s2
+; GFX12-SDAG-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, s3
 ; GFX12-SDAG-NEXT:    global_atomic_ordered_add_b64 v[0:1], v2, v[0:1], s[0:1] offset:32 th:TH_ATOMIC_RETURN
 ; GFX12-SDAG-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-SDAG-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll
index 9606c68684957..6a5c83248038d 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll
@@ -477,8 +477,8 @@ define amdgpu_kernel void @image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr,
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v3, null, s3, 0, s0
 ; GFX11-NEXT:    flat_load_b32 v9, v[0:1]
 ; GFX11-NEXT:    flat_load_b32 v10, v[2:3]
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0x40e00000
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0x40c00000
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0x40e00000
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0x41000000
 ; GFX11-NEXT:    v_mov_b32_e32 v3, 0x40400000
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -505,8 +505,8 @@ define amdgpu_kernel void @image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr,
 ; GFX12-SDAG-NEXT:    v_add_co_ci_u32_e64 v3, null, s3, 0, s0
 ; GFX12-SDAG-NEXT:    flat_load_b32 v9, v[0:1]
 ; GFX12-SDAG-NEXT:    flat_load_b32 v10, v[2:3]
-; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, 0x40e00000
 ; GFX12-SDAG-NEXT:    v_mov_b32_e32 v0, 0x40c00000
+; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, 0x40e00000
 ; GFX12-SDAG-NEXT:    v_mov_b32_e32 v2, 0x41000000
 ; GFX12-SDAG-NEXT:    v_mov_b32_e32 v3, 0x40400000
 ; GFX12-SDAG-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -633,8 +633,8 @@ define amdgpu_kernel void @image_bvh_intersect_ray_a16_nsa_reassign(ptr %p_node_
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v3, null, s3, 0, s0
 ; GFX11-NEXT:    flat_load_b32 v6, v[0:1]
 ; GFX11-NEXT:    flat_load_b32 v7, v[2:3]
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0x46004200
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX11-NEXT:    v_dual_mov_b32 v2, 0x48004500 :: v_dual_mov_b32 v3, 0
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    image_bvh_intersect_ray v[0:3], [v6, v7, v[3:5], v[0:2]], s[4:7] a16
@@ -658,8 +658,8 @@ define amdgpu_kernel void @image_bvh_intersect_ray_a16_nsa_reassign(ptr %p_node_
 ; GFX12-SDAG-NEXT:    v_add_co_ci_u32_e64 v3, null, s3, 0, s0
 ; GFX12-SDAG-NEXT:    flat_load_b32 v6, v[0:1]
 ; GFX12-SDAG-NEXT:    flat_load_b32 v7, v[2:3]
-; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX12-SDAG-NEXT:    v_mov_b32_e32 v0, 0x46004200
+; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX12-SDAG-NEXT:    v_dual_mov_b32 v2, 0x48004500 :: v_dual_mov_b32 v3, 0
 ; GFX12-SDAG-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX12-SDAG-NEXT:    image_bvh_intersect_ray v[0:3], [v6, v7, v[3:5], v[0:2]], s[4:7] a16
@@ -947,8 +947,8 @@ define amdgpu_kernel void @image_bvh64_intersect_ray_a16_nsa_reassign(ptr %p_ray
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v1, null, s7, 0, s4
 ; GFX11-NEXT:    flat_load_b32 v8, v[0:1]
-; GFX11-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0x46004200
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    image_bvh64_intersect_ray v[0:3], [v[6:7], v8, v[3:5], v[0:2]], s[0:3] a16
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
@@ -973,8 +973,8 @@ define amdgpu_kernel void @image_bvh64_intersect_ray_a16_nsa_reassign(ptr %p_ray
 ; GFX12-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX12-SDAG-NEXT:    v_add_co_ci_u32_e64 v1, null, s7, 0, s4
 ; GFX12-SDAG-NEXT:    flat_load_b32 v8, v[0:1]
-; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX12-SDAG-NEXT:    v_mov_b32_e32 v0, 0x46004200
+; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, 0x47004400
 ; GFX12-SDAG-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX12-SDAG-NEXT:    image_bvh64_intersect_ray v[0:3], [v[6:7], v8, v[3:5], v[0:2]], s[0:3] a16
 ; GFX12-SDAG-NEXT:    s_wait_bvhcnt 0x0
@@ -995,12 +995,12 @@ define amdgpu_kernel void @image_bvh64_intersect_ray_a16_nsa_reassign(ptr %p_ray
 ; GFX12-GISEL-NEXT:    s_mov_b32 s10, 0x45004800
 ; GFX12-GISEL-NEXT:    v_mov_b32_e32 v6, 0xb36211c6
 ; GFX12-GISEL-NEXT:    v_bfrev_b32_e32 v7, 4.0
-; GFX12-GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX12-GISEL-NEXT:    v_dual_mov_b32 v5, s10 :: v_dual_mov_b32 v4, s9
+; GFX12-GISEL-NEXT:    v_dual_mov_b32 v3, s8 :: v_dual_mov_b32 v4, s9
 ; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12-GISEL-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s7
+; GFX12-GISEL-NEXT:    v_dual_mov_b32 v5, s10 :: v_dual_mov_b32 v0, s6
+; GFX12-GISEL-NEXT:    v_mov_b32_e32 v1, s7
 ; GFX12-GISEL-NEXT:    s_mov_b32 s6, 2.0
-; GFX12-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX12-GISEL-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
 ; GFX12-GISEL-NEXT:    v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
 ; GFX12-GISEL-NEXT:    flat_load_b32 v8, v[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll
index fb755ea2e5a7f..24e213ea2fe55 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll
@@ -183,8 +183,8 @@ entry:
 define amdgpu_cs void @sgpr_inverse_ballot(i64 inreg %input, ptr addrspace(1) %out) {
 ; GISEL_W64-LABEL: sgpr_inverse_ballot:
 ; GISEL_W64:       ; %bb.0: ; %entry
-; GISEL_W64-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s[0:1]
 ; GISEL_W64-NEXT:    v_mov_b32_e32 v3, 0
+; GISEL_W64-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s[0:1]
 ; GISEL_W64-NEXT:    global_store_b64 v[0:1], v[2:3], off
 ; GISEL_W64-NEXT:    s_endpgm
 ;
@@ -199,8 +199,8 @@ define amdgpu_cs void @sgpr_inverse_ballot(i64 inreg %input, ptr addrspace(1) %o
 ;
 ; GISEL_W32-LABEL: sgpr_inverse_ballot:
 ; GISEL_W32:       ; %bb.0: ; %entry
-; GISEL_W32-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
 ; GISEL_W32-NEXT:    v_mov_b32_e32 v3, 0
+; GISEL_W32-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
 ; GISEL_W32-NEXT:    global_store_b64 v[0:1], v[2:3], off
 ; GISEL_W32-NEXT:    s_endpgm
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll
index f7c37caf41eab..393d8c1a1bf2f 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll
@@ -69,14 +69,13 @@ define amdgpu_kernel void @test_v3p3(ptr addrspace(1) %out, <3 x ptr addrspace(3
 ; GFX11-SDAG-NEXT:    s_clause 0x1
 ; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x34
 ; GFX11-SDAG-NEXT:    s_load_b64 s[4:5], s[4:5], 0x24
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v3, s0
 ; GFX11-SDAG-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, s1
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v3, s0
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v2, v0
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v1, v1
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v0, v3
 ; GFX11-SDAG-NEXT:    global_store_b96 v4, v[0:2], s[4:5]
 ; GFX11-SDAG-NEXT:    s_endpgm
@@ -108,14 +107,13 @@ define amdgpu_kernel void @test_v3p5(ptr addrspace(1) %out, <3 x ptr addrspace(5
 ; GFX11-SDAG-NEXT:    s_clause 0x1
 ; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x34
 ; GFX11-SDAG-NEXT:    s_load_b64 s[4:5], s[4:5], 0x24
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v3, s0
 ; GFX11-SDAG-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, s1
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v3, s0
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v2, v0
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v1, v1
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v0, v3
 ; GFX11-SDAG-NEXT:    global_store_b96 v4, v[0:2], s[4:5]
 ; GFX11-SDAG-NEXT:    s_endpgm
@@ -147,14 +145,13 @@ define amdgpu_kernel void @test_v3p6(ptr addrspace(1) %out, <3 x ptr addrspace(6
 ; GFX11-SDAG-NEXT:    s_clause 0x1
 ; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x34
 ; GFX11-SDAG-NEXT:    s_load_b64 s[4:5], s[4:5], 0x24
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v3, s0
 ; GFX11-SDAG-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, s1
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v3, s0
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v2, v0
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v1, v1
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_4)
 ; GFX11-SDAG-NEXT:    v_permlane64_b32 v0, v3
 ; GFX11-SDAG-NEXT:    global_store_b96 v4, v[0:2], s[4:5]
 ; GFX11-SDAG-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll
index c0afc0a443955..49a334b8b6c52 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll
@@ -984,9 +984,9 @@ define void @test_readfirstlane_v32f32(ptr addrspace(1) %out, <32 x float> %src)
 ; CHECK-SDAG-NEXT:    s_xor_saveexec_b64 s[4:5], -1
 ; CHECK-SDAG-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
 ; CHECK-SDAG-NEXT:    s_mov_b64 exec, s[4:5]
-; CHECK-SDAG-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4
-; CHECK-SDAG-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s61, v27
+; CHECK-SDAG-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:8
+; CHECK-SDAG-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
 ; CHECK-SDAG-NEXT:    buffer_load_dword v27, off, s[0:3], s32
 ; CHECK-SDAG-NEXT:    v_writelane_b32 v31, s36, 0
 ; CHECK-SDAG-NEXT:    v_writelane_b32 v31, s37, 1
@@ -1033,9 +1033,9 @@ define void @test_readfirstlane_v32f32(ptr addrspace(1) %out, <32 x float> %src)
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s41, v7
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s40, v6
 ; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(2)
-; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s66, v0
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s67, v0
 ; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(1)
-; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s67, v1
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s66, v1
 ; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s65, v27
 ; CHECK-SDAG-NEXT:    ;;#ASMSTART
@@ -1429,9 +1429,9 @@ define void @test_readfirstlane_v32i32(ptr addrspace(1) %out, <32 x i32> %src) {
 ; CHECK-SDAG-NEXT:    s_xor_saveexec_b64 s[4:5], -1
 ; CHECK-SDAG-NEXT:    buffer_store_dword v31, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill
 ; CHECK-SDAG-NEXT:    s_mov_b64 exec, s[4:5]
-; CHECK-SDAG-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4
-; CHECK-SDAG-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s61, v27
+; CHECK-SDAG-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:8
+; CHECK-SDAG-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:4
 ; CHECK-SDAG-NEXT:    buffer_load_dword v27, off, s[0:3], s32
 ; CHECK-SDAG-NEXT:    v_writelane_b32 v31, s36, 0
 ; CHECK-SDAG-NEXT:    v_writelane_b32 v31, s37, 1
@@ -1478,9 +1478,9 @@ define void @test_readfirstlane_v32i32(ptr addrspace(1) %out, <32 x i32> %src) {
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s41, v7
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s40, v6
 ; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(2)
-; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s66, v0
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s67, v0
 ; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(1)
-; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s67, v1
+; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s66, v1
 ; CHECK-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-SDAG-NEXT:    v_readfirstlane_b32 s65, v27
 ; CHECK-SDAG-NEXT:    ;;#ASMSTART
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll
index 8cf7497fca640..b6656569b79c1 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll
@@ -3283,8 +3283,8 @@ define void @test_writelane_v8f64(ptr addrspace(1) %out, <8 x double> %src, i32
 ; GFX802-SDAG-NEXT:    v_readfirstlane_b32 s7, v3
 ; GFX802-SDAG-NEXT:    v_readfirstlane_b32 s8, v2
 ; GFX802-SDAG-NEXT:    v_addc_u32_e32 v23, vcc, 0, v1, vcc
-; GFX802-SDAG-NEXT:    flat_load_dwordx4 v[2:5], v[22:23]
 ; GFX802-SDAG-NEXT:    s_mov_b32 m0, s4
+; GFX802-SDAG-NEXT:    flat_load_dwordx4 v[2:5], v[22:23]
 ; GFX802-SDAG-NEXT:    v_readfirstlane_b32 s4, v9
 ; GFX802-SDAG-NEXT:    v_readfirstlane_b32 s10, v15
 ; GFX802-SDAG-NEXT:    v_readfirstlane_b32 s11, v14
@@ -3444,8 +3444,8 @@ define void @test_writelane_v8f64(ptr addrspace(1) %out, <8 x double> %src, i32
 ; GFX802-GISEL-NEXT:    v_readfirstlane_b32 s7, v4
 ; GFX802-GISEL-NEXT:    v_readfirstlane_b32 s8, v5
 ; GFX802-GISEL-NEXT:    v_addc_u32_e32 v23, vcc, 0, v1, vcc
-; GFX802-GISEL-NEXT:    flat_load_dwordx4 v[2:5], v[22:23]
 ; GFX802-GISEL-NEXT:    s_mov_b32 m0, s5
+; GFX802-GISEL-NEXT:    flat_load_dwordx4 v[2:5], v[22:23]
 ; GFX802-GISEL-NEXT:    v_readfirstlane_b32 s5, v7
 ; GFX802-GISEL-NEXT:    v_readfirstlane_b32 s9, v11
 ; GFX802-GISEL-NEXT:    v_readfirstlane_b32 s10, v12
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.log.ll b/llvm/test/CodeGen/AMDGPU/llvm.log.ll
index 1dd6a7926029e..ca4571dcb3719 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.log.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.log.ll
@@ -1945,10 +1945,11 @@ define amdgpu_kernel void @s_log_v4f32(ptr addrspace(1) %out, <4 x float> %in) {
 ; GFX1100-GISEL-NEXT:    v_dual_fmac_f32 v12, 0x3377d1cf, v2 :: v_dual_fmac_f32 v13, 0x3377d1cf, v3
 ; GFX1100-GISEL-NEXT:    v_add_f32_e32 v7, v7, v12
 ; GFX1100-GISEL-NEXT:    s_waitcnt_depctr 0xfff
-; GFX1100-GISEL-NEXT:    v_dual_mul_f32 v5, 0x3f317217, v0 :: v_dual_add_f32 v8, v8, v13
-; GFX1100-GISEL-NEXT:    v_mul_f32_e32 v6, 0x3f317217, v1
+; GFX1100-GISEL-NEXT:    v_mul_f32_e32 v5, 0x3f317217, v0
 ; GFX1100-GISEL-NEXT:    v_cmp_gt_f32_e64 vcc_lo, 0x7f800000, |v0|
-; GFX1100-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX1100-GISEL-NEXT:    v_mul_f32_e32 v6, 0x3f317217, v1
+; GFX1100-GISEL-NEXT:    v_add_f32_e32 v8, v8, v13
+; GFX1100-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX1100-GISEL-NEXT:    v_fma_f32 v10, 0x3f317217, v0, -v5
 ; GFX1100-GISEL-NEXT:    v_fma_f32 v11, 0x3f317217, v1, -v6
 ; GFX1100-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.log10.ll b/llvm/test/CodeGen/AMDGPU/llvm.log10.ll
index 86a58d26c6ae5..904945860c3eb 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.log10.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.log10.ll
@@ -1945,10 +1945,11 @@ define amdgpu_kernel void @s_log10_v4f32(ptr addrspace(1) %out, <4 x float> %in)
 ; GFX1100-GISEL-NEXT:    v_dual_fmac_f32 v12, 0x3284fbcf, v2 :: v_dual_fmac_f32 v13, 0x3284fbcf, v3
 ; GFX1100-GISEL-NEXT:    v_add_f32_e32 v7, v7, v12
 ; GFX1100-GISEL-NEXT:    s_waitcnt_depctr 0xfff
-; GFX1100-GISEL-NEXT:    v_dual_mul_f32 v5, 0x3e9a209a, v0 :: v_dual_add_f32 v8, v8, v13
-; GFX1100-GISEL-NEXT:    v_mul_f32_e32 v6, 0x3e9a209a, v1
+; GFX1100-GISEL-NEXT:    v_mul_f32_e32 v5, 0x3e9a209a, v0
 ; GFX1100-GISEL-NEXT:    v_cmp_gt_f32_e64 vcc_lo, 0x7f800000, |v0|
-; GFX1100-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX1100-GISEL-NEXT:    v_mul_f32_e32 v6, 0x3e9a209a, v1
+; GFX1100-GISEL-NEXT:    v_add_f32_e32 v8, v8, v13
+; GFX1100-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX1100-GISEL-NEXT:    v_fma_f32 v10, 0x3e9a209a, v0, -v5
 ; GFX1100-GISEL-NEXT:    v_fma_f32 v11, 0x3e9a209a, v1, -v6
 ; GFX1100-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll b/llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll
index 362b9dacaf257..ef35d42f299f8 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll
@@ -2313,121 +2313,121 @@ define <16 x half> @v_maximum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX7-LABEL: v_maximum_v16f16:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v17
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v0, v0
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v2, v2
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v3, v3
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v0, v0
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX7-NEXT:    v_max_f32_e32 v0, v0, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v17
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v3, v3
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
-; GFX7-NEXT:    v_max_f32_e32 v1, v1, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v18
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v4, v4
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v6, v6
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v7, v7
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v4, v4
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v5, v5
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v16
+; GFX7-NEXT:    v_max_f32_e32 v1, v1, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v18
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v17
-; GFX7-NEXT:    v_max_f32_e32 v2, v2, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v19
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v7, v7
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v9, v9
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v10, v10
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v7, v7
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v8, v8
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v16
+; GFX7-NEXT:    v_max_f32_e32 v2, v2, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v19
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v9, v9
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v17
-; GFX7-NEXT:    v_max_f32_e32 v3, v3, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v20
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v10, v10
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v10, v10
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v27
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v11, v11
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v18, v28
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v10, v10
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v16
+; GFX7-NEXT:    v_max_f32_e32 v3, v3, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v20
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v11, v11
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v17
-; GFX7-NEXT:    v_max_f32_e32 v4, v4, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v21
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v12, v12
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v13, v13
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v19, v16
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v18
-; GFX7-NEXT:    v_max_f32_e32 v12, v12, v18
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v18, v29
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v17
-; GFX7-NEXT:    v_max_f32_e32 v5, v5, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v22
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v20, v0
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v18
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v18, v13
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v0, v19
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v13, v20
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[26:27], v18, v16
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v17
-; GFX7-NEXT:    v_max_f32_e32 v6, v6, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v23
-; GFX7-NEXT:    v_max_f32_e32 v16, v18, v16
-; GFX7-NEXT:    v_max_f32_e32 v18, v13, v0
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v0
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v13, v15
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v15, v30
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v12, v12
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v19, v29
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v17
+; GFX7-NEXT:    v_max_f32_e32 v11, v11, v17
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v28
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v16
+; GFX7-NEXT:    v_max_f32_e32 v4, v4, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v21
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v20, v13
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v13, v17
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v18, v12
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v12, v19
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v20
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[26:27], v18, v13
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v16
+; GFX7-NEXT:    v_max_f32_e32 v5, v5, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v22
+; GFX7-NEXT:    v_max_f32_e32 v13, v18, v13
+; GFX7-NEXT:    v_max_f32_e32 v18, v17, v12
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[28:29], v17, v12
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v14, v14
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v17
-; GFX7-NEXT:    v_max_f32_e32 v7, v7, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v24
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v15, v15
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v14, v14
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v20, v13
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v15, v15
 ; GFX7-NEXT:    v_mov_b32_e32 v19, 0x7fc00000
-; GFX7-NEXT:    v_cndmask_b32_e32 v1, v19, v1, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v13, v19, v16, s[26:27]
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v17
-; GFX7-NEXT:    v_max_f32_e32 v8, v8, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v25
-; GFX7-NEXT:    v_max_f32_e32 v16, v14, v15
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v14, v15
-; GFX7-NEXT:    v_cndmask_b32_e32 v14, v19, v16, vcc
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cndmask_b32_e64 v2, v19, v2, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v3, v19, v3, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v4, v19, v4, s[8:9]
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v17
-; GFX7-NEXT:    v_max_f32_e32 v9, v9, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v26
-; GFX7-NEXT:    v_cndmask_b32_e64 v5, v19, v5, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v6, v19, v6, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v7, v19, v7, s[14:15]
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cndmask_b32_e64 v8, v19, v8, s[16:17]
-; GFX7-NEXT:    v_cndmask_b32_e64 v9, v19, v9, s[18:19]
-; GFX7-NEXT:    v_cndmask_b32_e64 v12, v19, v12, s[24:25]
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v17
-; GFX7-NEXT:    v_max_f32_e32 v10, v10, v17
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v17, v27
-; GFX7-NEXT:    v_cndmask_b32_e64 v10, v19, v10, s[20:21]
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v17
-; GFX7-NEXT:    v_max_f32_e32 v11, v11, v17
-; GFX7-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX7-NEXT:    v_cndmask_b32_e64 v11, v19, v11, s[22:23]
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v16
+; GFX7-NEXT:    v_max_f32_e32 v6, v6, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v23
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v14, v14
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v15, v15
+; GFX7-NEXT:    v_cndmask_b32_e32 v0, v19, v0, vcc
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cndmask_b32_e64 v1, v19, v1, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v2, v19, v2, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v3, v19, v3, s[8:9]
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v16
+; GFX7-NEXT:    v_max_f32_e32 v7, v7, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v24
+; GFX7-NEXT:    v_cndmask_b32_e64 v4, v19, v4, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v5, v19, v5, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, v19, v6, s[14:15]
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cndmask_b32_e64 v7, v19, v7, s[16:17]
+; GFX7-NEXT:    v_cndmask_b32_e64 v11, v19, v11, s[24:25]
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v16
+; GFX7-NEXT:    v_max_f32_e32 v8, v8, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v25
+; GFX7-NEXT:    v_cndmask_b32_e64 v8, v19, v8, s[18:19]
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v16
+; GFX7-NEXT:    v_max_f32_e32 v9, v9, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v26
+; GFX7-NEXT:    v_cndmask_b32_e64 v9, v19, v9, s[20:21]
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v16
+; GFX7-NEXT:    v_max_f32_e32 v10, v10, v16
+; GFX7-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX7-NEXT:    v_cndmask_b32_e64 v10, v19, v10, s[22:23]
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_cvt_f16_f32_e32 v0, v17
-; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v0
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v19, v18, s[28:29]
-; GFX7-NEXT:    v_max_f32_e32 v15, v20, v17
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v20, v17
-; GFX7-NEXT:    v_cndmask_b32_e32 v15, v19, v15, vcc
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v12, v16
+; GFX7-NEXT:    v_cvt_f16_f32_e32 v16, v30
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v17, v12
+; GFX7-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; GFX7-NEXT:    v_cndmask_b32_e64 v12, v19, v13, s[26:27]
+; GFX7-NEXT:    v_cndmask_b32_e64 v13, v19, v18, s[28:29]
+; GFX7-NEXT:    v_max_f32_e32 v18, v14, v16
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v14, v16
+; GFX7-NEXT:    v_cndmask_b32_e32 v14, v19, v18, vcc
+; GFX7-NEXT:    v_max_f32_e32 v16, v15, v17
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
+; GFX7-NEXT:    v_cndmask_b32_e32 v15, v19, v16, vcc
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_maximum_v16f16:
@@ -2455,6 +2455,7 @@ define <16 x half> @v_maximum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[10:11], v18, v17
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v17, 16, v9
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v18, 16, v1
+; GFX8-NEXT:    v_mov_b32_e32 v19, 0x7e00
 ; GFX8-NEXT:    v_max_f16_e32 v24, v18, v17
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[12:13], v18, v17
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v17, 16, v8
@@ -2469,28 +2470,26 @@ define <16 x half> @v_maximum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[20:21], v4, v12
 ; GFX8-NEXT:    v_max_f16_e32 v4, v3, v11
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[22:23], v3, v11
+; GFX8-NEXT:    v_max_f16_e32 v3, v2, v10
 ; GFX8-NEXT:    v_max_f16_e32 v11, v7, v15
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[24:25], v7, v15
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v12, 16, v15
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GFX8-NEXT:    v_mov_b32_e32 v19, 0x7e00
+; GFX8-NEXT:    v_cndmask_b32_e32 v14, v19, v16, vcc
+; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v2, v10
 ; GFX8-NEXT:    v_max_f16_e32 v13, v7, v12
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[26:27], v7, v12
-; GFX8-NEXT:    v_max_f16_e32 v3, v2, v10
-; GFX8-NEXT:    v_cndmask_b32_e64 v12, v19, v13, s[26:27]
-; GFX8-NEXT:    v_cndmask_b32_e32 v13, v19, v16, vcc
-; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v2, v10
-; GFX8-NEXT:    v_max_f16_e32 v14, v1, v9
+; GFX8-NEXT:    v_max_f16_e32 v7, v1, v9
 ; GFX8-NEXT:    v_cndmask_b32_e32 v2, v19, v3, vcc
 ; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v1, v9
-; GFX8-NEXT:    v_max_f16_e32 v7, v0, v8
+; GFX8-NEXT:    v_max_f16_e32 v12, v0, v8
 ; GFX8-NEXT:    v_cndmask_b32_e64 v18, v19, v22, s[8:9]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v22, v19, v25, s[14:15]
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v19, v14, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v1, v19, v7, vcc
 ; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v0, v8
 ; GFX8-NEXT:    v_cndmask_b32_e64 v16, v19, v21, s[6:7]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v21, v19, v24, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e32 v0, v19, v7, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v0, v19, v12, vcc
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v3, 16, v22
 ; GFX8-NEXT:    v_cndmask_b32_e64 v15, v19, v20, s[4:5]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v20, v19, v23, s[10:11]
@@ -2504,14 +2503,15 @@ define <16 x half> @v_maximum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX8-NEXT:    v_cndmask_b32_e64 v5, v19, v5, s[20:21]
 ; GFX8-NEXT:    v_or_b32_sdwa v3, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v4, 16, v16
+; GFX8-NEXT:    v_cndmask_b32_e64 v13, v19, v13, s[26:27]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v6, v19, v6, s[18:19]
 ; GFX8-NEXT:    v_or_b32_sdwa v4, v5, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v5, 16, v15
 ; GFX8-NEXT:    v_cndmask_b32_e64 v11, v19, v11, s[24:25]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v17, v19, v17, s[16:17]
 ; GFX8-NEXT:    v_or_b32_sdwa v5, v6, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_lshlrev_b32_e32 v6, 16, v13
-; GFX8-NEXT:    v_lshlrev_b32_e32 v7, 16, v12
+; GFX8-NEXT:    v_lshlrev_b32_e32 v6, 16, v14
+; GFX8-NEXT:    v_lshlrev_b32_e32 v7, 16, v13
 ; GFX8-NEXT:    v_or_b32_sdwa v6, v17, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v7, v11, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll b/llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll
index 8b1ba393c8de8..826bf427503ab 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll
@@ -1684,7 +1684,7 @@ define <8 x float> @v_maximum_v8f32(<8 x float> %src0, <8 x float> %src1) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-NEXT:    v_cndmask_b32_e32 v0, 0x7fc00000, v16, vcc_lo
 ; GFX11-NEXT:    v_cmp_o_f32_e32 vcc_lo, v1, v9
-; GFX11-NEXT:    v_dual_max_f32 v9, v3, v11 :: v_dual_max_f32 v8, v2, v10
+; GFX11-NEXT:    v_dual_max_f32 v8, v2, v10 :: v_dual_max_f32 v9, v3, v11
 ; GFX11-NEXT:    v_cndmask_b32_e32 v1, 0x7fc00000, v17, vcc_lo
 ; GFX11-NEXT:    v_cmp_o_f32_e32 vcc_lo, v2, v10
 ; GFX11-NEXT:    v_max_f32_e32 v10, v7, v15
@@ -1727,169 +1727,169 @@ define <16 x float> @v_maximum_v16f32(<16 x float> %src0, <16 x float> %src1) {
 ; GFX7-LABEL: v_maximum_v16f32:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX7-NEXT:    v_max_f32_e32 v0, v0, v16
+; GFX7-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v17
 ; GFX7-NEXT:    v_max_f32_e32 v1, v1, v17
-; GFX7-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v18
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v18
 ; GFX7-NEXT:    v_max_f32_e32 v2, v2, v18
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v19
+; GFX7-NEXT:    v_mov_b32_e32 v17, 0x7fc00000
+; GFX7-NEXT:    v_max_f32_e32 v18, v13, v29
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v29
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v19
 ; GFX7-NEXT:    v_max_f32_e32 v3, v3, v19
-; GFX7-NEXT:    v_mov_b32_e32 v18, 0x7fc00000
-; GFX7-NEXT:    v_max_f32_e32 v19, v0, v16
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[28:29], v0, v16
-; GFX7-NEXT:    v_max_f32_e32 v16, v14, v30
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v20
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v20
 ; GFX7-NEXT:    v_max_f32_e32 v4, v4, v20
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v21
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v21
 ; GFX7-NEXT:    v_max_f32_e32 v5, v5, v21
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v22
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v22
 ; GFX7-NEXT:    v_max_f32_e32 v6, v6, v22
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v23
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v23
 ; GFX7-NEXT:    v_max_f32_e32 v7, v7, v23
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v24
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v24
 ; GFX7-NEXT:    v_max_f32_e32 v8, v8, v24
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v25
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v25
 ; GFX7-NEXT:    v_max_f32_e32 v9, v9, v25
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v26
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v26
 ; GFX7-NEXT:    v_max_f32_e32 v10, v10, v26
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v27
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v27
 ; GFX7-NEXT:    v_max_f32_e32 v11, v11, v27
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v28
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[26:27], v12, v28
 ; GFX7-NEXT:    v_max_f32_e32 v12, v12, v28
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[26:27], v13, v29
-; GFX7-NEXT:    v_max_f32_e32 v13, v13, v29
-; GFX7-NEXT:    v_cndmask_b32_e32 v1, v18, v1, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v14, v18, v16, s[40:41]
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v18, v19, s[28:29]
-; GFX7-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v3, v18, v3, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v4, v18, v4, s[8:9]
-; GFX7-NEXT:    v_cndmask_b32_e64 v5, v18, v5, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v6, v18, v6, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v7, v18, v7, s[14:15]
-; GFX7-NEXT:    v_cndmask_b32_e64 v8, v18, v8, s[16:17]
-; GFX7-NEXT:    v_cndmask_b32_e64 v9, v18, v9, s[18:19]
-; GFX7-NEXT:    v_cndmask_b32_e64 v10, v18, v10, s[20:21]
-; GFX7-NEXT:    v_cndmask_b32_e64 v11, v18, v11, s[22:23]
-; GFX7-NEXT:    v_cndmask_b32_e64 v12, v18, v12, s[24:25]
-; GFX7-NEXT:    v_cndmask_b32_e64 v13, v18, v13, s[26:27]
+; GFX7-NEXT:    v_max_f32_e32 v19, v14, v30
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
+; GFX7-NEXT:    v_cndmask_b32_e32 v0, v17, v0, vcc
+; GFX7-NEXT:    v_cndmask_b32_e64 v13, v17, v18, s[28:29]
+; GFX7-NEXT:    v_cndmask_b32_e64 v1, v17, v1, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v2, v17, v2, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v3, v17, v3, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v4, v17, v4, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v5, v17, v5, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, v17, v6, s[14:15]
+; GFX7-NEXT:    v_cndmask_b32_e64 v7, v17, v7, s[16:17]
+; GFX7-NEXT:    v_cndmask_b32_e64 v8, v17, v8, s[18:19]
+; GFX7-NEXT:    v_cndmask_b32_e64 v9, v17, v9, s[20:21]
+; GFX7-NEXT:    v_cndmask_b32_e64 v10, v17, v10, s[22:23]
+; GFX7-NEXT:    v_cndmask_b32_e64 v11, v17, v11, s[24:25]
+; GFX7-NEXT:    v_cndmask_b32_e64 v12, v17, v12, s[26:27]
+; GFX7-NEXT:    v_cndmask_b32_e64 v14, v17, v19, s[40:41]
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_max_f32_e32 v16, v15, v17
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
-; GFX7-NEXT:    v_cndmask_b32_e32 v15, v18, v16, vcc
+; GFX7-NEXT:    v_max_f32_e32 v18, v15, v16
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v15, v16
+; GFX7-NEXT:    v_cndmask_b32_e32 v15, v17, v18, vcc
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_maximum_v16f32:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
+; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX8-NEXT:    v_max_f32_e32 v0, v0, v16
+; GFX8-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v17
 ; GFX8-NEXT:    v_max_f32_e32 v1, v1, v17
-; GFX8-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v18
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v18
 ; GFX8-NEXT:    v_max_f32_e32 v2, v2, v18
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v19
+; GFX8-NEXT:    v_mov_b32_e32 v17, 0x7fc00000
+; GFX8-NEXT:    v_max_f32_e32 v18, v13, v29
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v29
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v19
 ; GFX8-NEXT:    v_max_f32_e32 v3, v3, v19
-; GFX8-NEXT:    v_mov_b32_e32 v18, 0x7fc00000
-; GFX8-NEXT:    v_max_f32_e32 v19, v0, v16
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[28:29], v0, v16
-; GFX8-NEXT:    v_max_f32_e32 v16, v14, v30
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v20
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v20
 ; GFX8-NEXT:    v_max_f32_e32 v4, v4, v20
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v21
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v21
 ; GFX8-NEXT:    v_max_f32_e32 v5, v5, v21
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v22
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v22
 ; GFX8-NEXT:    v_max_f32_e32 v6, v6, v22
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v23
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v23
 ; GFX8-NEXT:    v_max_f32_e32 v7, v7, v23
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v24
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v24
 ; GFX8-NEXT:    v_max_f32_e32 v8, v8, v24
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v25
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v25
 ; GFX8-NEXT:    v_max_f32_e32 v9, v9, v25
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v26
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v26
 ; GFX8-NEXT:    v_max_f32_e32 v10, v10, v26
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v27
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v27
 ; GFX8-NEXT:    v_max_f32_e32 v11, v11, v27
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v28
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[26:27], v12, v28
 ; GFX8-NEXT:    v_max_f32_e32 v12, v12, v28
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[26:27], v13, v29
-; GFX8-NEXT:    v_max_f32_e32 v13, v13, v29
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v18, v1, vcc
-; GFX8-NEXT:    v_cndmask_b32_e64 v14, v18, v16, s[40:41]
-; GFX8-NEXT:    v_cndmask_b32_e64 v0, v18, v19, s[28:29]
-; GFX8-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v3, v18, v3, s[6:7]
-; GFX8-NEXT:    v_cndmask_b32_e64 v4, v18, v4, s[8:9]
-; GFX8-NEXT:    v_cndmask_b32_e64 v5, v18, v5, s[10:11]
-; GFX8-NEXT:    v_cndmask_b32_e64 v6, v18, v6, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e64 v7, v18, v7, s[14:15]
-; GFX8-NEXT:    v_cndmask_b32_e64 v8, v18, v8, s[16:17]
-; GFX8-NEXT:    v_cndmask_b32_e64 v9, v18, v9, s[18:19]
-; GFX8-NEXT:    v_cndmask_b32_e64 v10, v18, v10, s[20:21]
-; GFX8-NEXT:    v_cndmask_b32_e64 v11, v18, v11, s[22:23]
-; GFX8-NEXT:    v_cndmask_b32_e64 v12, v18, v12, s[24:25]
-; GFX8-NEXT:    v_cndmask_b32_e64 v13, v18, v13, s[26:27]
+; GFX8-NEXT:    v_max_f32_e32 v19, v14, v30
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
+; GFX8-NEXT:    v_cndmask_b32_e32 v0, v17, v0, vcc
+; GFX8-NEXT:    v_cndmask_b32_e64 v13, v17, v18, s[28:29]
+; GFX8-NEXT:    v_cndmask_b32_e64 v1, v17, v1, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v2, v17, v2, s[6:7]
+; GFX8-NEXT:    v_cndmask_b32_e64 v3, v17, v3, s[8:9]
+; GFX8-NEXT:    v_cndmask_b32_e64 v4, v17, v4, s[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v5, v17, v5, s[12:13]
+; GFX8-NEXT:    v_cndmask_b32_e64 v6, v17, v6, s[14:15]
+; GFX8-NEXT:    v_cndmask_b32_e64 v7, v17, v7, s[16:17]
+; GFX8-NEXT:    v_cndmask_b32_e64 v8, v17, v8, s[18:19]
+; GFX8-NEXT:    v_cndmask_b32_e64 v9, v17, v9, s[20:21]
+; GFX8-NEXT:    v_cndmask_b32_e64 v10, v17, v10, s[22:23]
+; GFX8-NEXT:    v_cndmask_b32_e64 v11, v17, v11, s[24:25]
+; GFX8-NEXT:    v_cndmask_b32_e64 v12, v17, v12, s[26:27]
+; GFX8-NEXT:    v_cndmask_b32_e64 v14, v17, v19, s[40:41]
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_max_f32_e32 v16, v15, v17
-; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
-; GFX8-NEXT:    v_cndmask_b32_e32 v15, v18, v16, vcc
+; GFX8-NEXT:    v_max_f32_e32 v18, v15, v16
+; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v15, v16
+; GFX8-NEXT:    v_cndmask_b32_e32 v15, v17, v18, vcc
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX900-LABEL: v_maximum_v16f32:
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
+; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX900-NEXT:    v_max_f32_e32 v0, v0, v16
+; GFX900-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v17
 ; GFX900-NEXT:    v_max_f32_e32 v1, v1, v17
-; GFX900-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v18
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v18
 ; GFX900-NEXT:    v_max_f32_e32 v2, v2, v18
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v19
+; GFX900-NEXT:    v_mov_b32_e32 v17, 0x7fc00000
+; GFX900-NEXT:    v_max_f32_e32 v18, v13, v29
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v29
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v19
 ; GFX900-NEXT:    v_max_f32_e32 v3, v3, v19
-; GFX900-NEXT:    v_mov_b32_e32 v18, 0x7fc00000
-; GFX900-NEXT:    v_max_f32_e32 v19, v0, v16
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[28:29], v0, v16
-; GFX900-NEXT:    v_max_f32_e32 v16, v14, v30
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v20
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v20
 ; GFX900-NEXT:    v_max_f32_e32 v4, v4, v20
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v21
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v21
 ; GFX900-NEXT:    v_max_f32_e32 v5, v5, v21
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v22
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v22
 ; GFX900-NEXT:    v_max_f32_e32 v6, v6, v22
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v23
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v23
 ; GFX900-NEXT:    v_max_f32_e32 v7, v7, v23
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v24
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v24
 ; GFX900-NEXT:    v_max_f32_e32 v8, v8, v24
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v25
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v25
 ; GFX900-NEXT:    v_max_f32_e32 v9, v9, v25
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v26
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v26
 ; GFX900-NEXT:    v_max_f32_e32 v10, v10, v26
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v27
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v27
 ; GFX900-NEXT:    v_max_f32_e32 v11, v11, v27
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v28
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[26:27], v12, v28
 ; GFX900-NEXT:    v_max_f32_e32 v12, v12, v28
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[26:27], v13, v29
-; GFX900-NEXT:    v_max_f32_e32 v13, v13, v29
-; GFX900-NEXT:    v_cndmask_b32_e32 v1, v18, v1, vcc
-; GFX900-NEXT:    v_cndmask_b32_e64 v14, v18, v16, s[40:41]
-; GFX900-NEXT:    v_cndmask_b32_e64 v0, v18, v19, s[28:29]
-; GFX900-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v3, v18, v3, s[6:7]
-; GFX900-NEXT:    v_cndmask_b32_e64 v4, v18, v4, s[8:9]
-; GFX900-NEXT:    v_cndmask_b32_e64 v5, v18, v5, s[10:11]
-; GFX900-NEXT:    v_cndmask_b32_e64 v6, v18, v6, s[12:13]
-; GFX900-NEXT:    v_cndmask_b32_e64 v7, v18, v7, s[14:15]
-; GFX900-NEXT:    v_cndmask_b32_e64 v8, v18, v8, s[16:17]
-; GFX900-NEXT:    v_cndmask_b32_e64 v9, v18, v9, s[18:19]
-; GFX900-NEXT:    v_cndmask_b32_e64 v10, v18, v10, s[20:21]
-; GFX900-NEXT:    v_cndmask_b32_e64 v11, v18, v11, s[22:23]
-; GFX900-NEXT:    v_cndmask_b32_e64 v12, v18, v12, s[24:25]
-; GFX900-NEXT:    v_cndmask_b32_e64 v13, v18, v13, s[26:27]
+; GFX900-NEXT:    v_max_f32_e32 v19, v14, v30
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
+; GFX900-NEXT:    v_cndmask_b32_e32 v0, v17, v0, vcc
+; GFX900-NEXT:    v_cndmask_b32_e64 v13, v17, v18, s[28:29]
+; GFX900-NEXT:    v_cndmask_b32_e64 v1, v17, v1, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v2, v17, v2, s[6:7]
+; GFX900-NEXT:    v_cndmask_b32_e64 v3, v17, v3, s[8:9]
+; GFX900-NEXT:    v_cndmask_b32_e64 v4, v17, v4, s[10:11]
+; GFX900-NEXT:    v_cndmask_b32_e64 v5, v17, v5, s[12:13]
+; GFX900-NEXT:    v_cndmask_b32_e64 v6, v17, v6, s[14:15]
+; GFX900-NEXT:    v_cndmask_b32_e64 v7, v17, v7, s[16:17]
+; GFX900-NEXT:    v_cndmask_b32_e64 v8, v17, v8, s[18:19]
+; GFX900-NEXT:    v_cndmask_b32_e64 v9, v17, v9, s[20:21]
+; GFX900-NEXT:    v_cndmask_b32_e64 v10, v17, v10, s[22:23]
+; GFX900-NEXT:    v_cndmask_b32_e64 v11, v17, v11, s[24:25]
+; GFX900-NEXT:    v_cndmask_b32_e64 v12, v17, v12, s[26:27]
+; GFX900-NEXT:    v_cndmask_b32_e64 v14, v17, v19, s[40:41]
 ; GFX900-NEXT:    s_waitcnt vmcnt(0)
-; GFX900-NEXT:    v_max_f32_e32 v16, v15, v17
-; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
-; GFX900-NEXT:    v_cndmask_b32_e32 v15, v18, v16, vcc
+; GFX900-NEXT:    v_max_f32_e32 v18, v15, v16
+; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v15, v16
+; GFX900-NEXT:    v_cndmask_b32_e32 v15, v17, v18, vcc
 ; GFX900-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX950-LABEL: v_maximum_v16f32:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll b/llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll
index 3344c73f9eb6f..f971080e02c5b 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll
@@ -820,18 +820,18 @@ define void @s_maximum_v2f64(<2 x double> inreg %src0, <2 x double> inreg %src1)
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    v_mov_b32_e32 v0, s22
-; GFX7-NEXT:    v_mov_b32_e32 v4, s20
 ; GFX7-NEXT:    v_mov_b32_e32 v1, s23
-; GFX7-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX7-NEXT:    v_max_f64 v[2:3], s[18:19], v[0:1]
 ; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, s[18:19], v[0:1]
-; GFX7-NEXT:    v_max_f64 v[0:1], s[16:17], v[4:5]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[4:5]
+; GFX7-NEXT:    v_mov_b32_e32 v0, s20
+; GFX7-NEXT:    v_mov_b32_e32 v1, s21
+; GFX7-NEXT:    v_max_f64 v[4:5], s[16:17], v[0:1]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[0:1]
 ; GFX7-NEXT:    v_mov_b32_e32 v6, 0x7ff80000
 ; GFX7-NEXT:    v_cndmask_b32_e32 v3, v3, v6, vcc
 ; GFX7-NEXT:    v_cndmask_b32_e64 v2, v2, 0, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v0, 0, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v1, v5, v6, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v0, v4, 0, s[4:5]
 ; GFX7-NEXT:    ;;#ASMSTART
 ; GFX7-NEXT:    ; use v[0:3]
 ; GFX7-NEXT:    ;;#ASMEND
@@ -841,18 +841,18 @@ define void @s_maximum_v2f64(<2 x double> inreg %src0, <2 x double> inreg %src1)
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    v_mov_b32_e32 v0, s22
-; GFX8-NEXT:    v_mov_b32_e32 v4, s20
 ; GFX8-NEXT:    v_mov_b32_e32 v1, s23
-; GFX8-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX8-NEXT:    v_max_f64 v[2:3], s[18:19], v[0:1]
 ; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, s[18:19], v[0:1]
-; GFX8-NEXT:    v_max_f64 v[0:1], s[16:17], v[4:5]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[4:5]
+; GFX8-NEXT:    v_mov_b32_e32 v0, s20
+; GFX8-NEXT:    v_mov_b32_e32 v1, s21
+; GFX8-NEXT:    v_max_f64 v[4:5], s[16:17], v[0:1]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[0:1]
 ; GFX8-NEXT:    v_mov_b32_e32 v6, 0x7ff80000
 ; GFX8-NEXT:    v_cndmask_b32_e32 v3, v3, v6, vcc
 ; GFX8-NEXT:    v_cndmask_b32_e64 v2, v2, 0, vcc
-; GFX8-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v0, v0, 0, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v1, v5, v6, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v0, v4, 0, s[4:5]
 ; GFX8-NEXT:    ;;#ASMSTART
 ; GFX8-NEXT:    ; use v[0:3]
 ; GFX8-NEXT:    ;;#ASMEND
@@ -862,18 +862,18 @@ define void @s_maximum_v2f64(<2 x double> inreg %src0, <2 x double> inreg %src1)
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX900-NEXT:    v_mov_b32_e32 v0, s22
-; GFX900-NEXT:    v_mov_b32_e32 v4, s20
 ; GFX900-NEXT:    v_mov_b32_e32 v1, s23
-; GFX900-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX900-NEXT:    v_max_f64 v[2:3], s[18:19], v[0:1]
 ; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, s[18:19], v[0:1]
-; GFX900-NEXT:    v_max_f64 v[0:1], s[16:17], v[4:5]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[4:5]
+; GFX900-NEXT:    v_mov_b32_e32 v0, s20
+; GFX900-NEXT:    v_mov_b32_e32 v1, s21
+; GFX900-NEXT:    v_max_f64 v[4:5], s[16:17], v[0:1]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[0:1]
 ; GFX900-NEXT:    v_mov_b32_e32 v6, 0x7ff80000
 ; GFX900-NEXT:    v_cndmask_b32_e32 v3, v3, v6, vcc
 ; GFX900-NEXT:    v_cndmask_b32_e64 v2, v2, 0, vcc
-; GFX900-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v0, v0, 0, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v1, v5, v6, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v0, v4, 0, s[4:5]
 ; GFX900-NEXT:    ;;#ASMSTART
 ; GFX900-NEXT:    ; use v[0:3]
 ; GFX900-NEXT:    ;;#ASMEND
@@ -1743,120 +1743,120 @@ define <8 x double> @v_maximum_v8f64(<8 x double> %src0, <8 x double> %src1) {
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX7-NEXT:    v_max_f64 v[32:33], v[2:3], v[18:19]
-; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[18:19]
-; GFX7-NEXT:    v_max_f64 v[18:19], v[4:5], v[20:21]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], v[4:5], v[20:21]
-; GFX7-NEXT:    v_max_f64 v[2:3], v[0:1], v[16:17]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[8:9], v[0:1], v[16:17]
+; GFX7-NEXT:    v_max_f64 v[32:33], v[0:1], v[16:17]
+; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[16:17]
+; GFX7-NEXT:    v_max_f64 v[16:17], v[2:3], v[18:19]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], v[2:3], v[18:19]
 ; GFX7-NEXT:    v_mov_b32_e32 v34, 0x7ff80000
+; GFX7-NEXT:    v_max_f64 v[18:19], v[4:5], v[20:21]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[6:7], v[4:5], v[20:21]
 ; GFX7-NEXT:    v_max_f64 v[20:21], v[6:7], v[22:23]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[6:7], v[6:7], v[22:23]
-; GFX7-NEXT:    v_max_f64 v[16:17], v[8:9], v[24:25]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[8:9], v[6:7], v[22:23]
+; GFX7-NEXT:    v_max_f64 v[22:23], v[8:9], v[24:25]
 ; GFX7-NEXT:    v_cmp_u_f64_e64 s[10:11], v[8:9], v[24:25]
-; GFX7-NEXT:    v_max_f64 v[22:23], v[10:11], v[26:27]
+; GFX7-NEXT:    v_max_f64 v[24:25], v[10:11], v[26:27]
 ; GFX7-NEXT:    v_cmp_u_f64_e64 s[12:13], v[10:11], v[26:27]
-; GFX7-NEXT:    v_max_f64 v[24:25], v[12:13], v[28:29]
+; GFX7-NEXT:    v_max_f64 v[26:27], v[12:13], v[28:29]
 ; GFX7-NEXT:    v_cmp_u_f64_e64 s[14:15], v[12:13], v[28:29]
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v2, 0, s[8:9]
-; GFX7-NEXT:    v_cndmask_b32_e64 v1, v3, v34, s[8:9]
-; GFX7-NEXT:    v_cndmask_b32_e64 v2, v32, 0, vcc
-; GFX7-NEXT:    v_cndmask_b32_e32 v3, v33, v34, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v8, v16, 0, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v9, v17, v34, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v10, v22, 0, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v11, v23, v34, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v12, v24, 0, s[14:15]
-; GFX7-NEXT:    v_cndmask_b32_e64 v13, v25, v34, s[14:15]
+; GFX7-NEXT:    v_cndmask_b32_e64 v0, v32, 0, vcc
+; GFX7-NEXT:    v_cndmask_b32_e32 v1, v33, v34, vcc
+; GFX7-NEXT:    v_cndmask_b32_e64 v2, v16, 0, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v3, v17, v34, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v8, v22, 0, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v9, v23, v34, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v10, v24, 0, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v11, v25, v34, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v12, v26, 0, s[14:15]
+; GFX7-NEXT:    v_cndmask_b32_e64 v13, v27, v34, s[14:15]
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_max_f64 v[18:19], v[14:15], v[30:31]
+; GFX7-NEXT:    v_max_f64 v[16:17], v[14:15], v[30:31]
 ; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[30:31]
-; GFX7-NEXT:    v_cndmask_b32_e64 v14, v18, 0, vcc
-; GFX7-NEXT:    v_cndmask_b32_e32 v15, v19, v34, vcc
+; GFX7-NEXT:    v_cndmask_b32_e64 v14, v16, 0, vcc
+; GFX7-NEXT:    v_cndmask_b32_e32 v15, v17, v34, vcc
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_maximum_v8f64:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX8-NEXT:    v_max_f64 v[32:33], v[2:3], v[18:19]
-; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[18:19]
-; GFX8-NEXT:    v_max_f64 v[18:19], v[4:5], v[20:21]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], v[4:5], v[20:21]
-; GFX8-NEXT:    v_max_f64 v[2:3], v[0:1], v[16:17]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[8:9], v[0:1], v[16:17]
+; GFX8-NEXT:    v_max_f64 v[32:33], v[0:1], v[16:17]
+; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[16:17]
+; GFX8-NEXT:    v_max_f64 v[16:17], v[2:3], v[18:19]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], v[2:3], v[18:19]
 ; GFX8-NEXT:    v_mov_b32_e32 v34, 0x7ff80000
+; GFX8-NEXT:    v_max_f64 v[18:19], v[4:5], v[20:21]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[6:7], v[4:5], v[20:21]
 ; GFX8-NEXT:    v_max_f64 v[20:21], v[6:7], v[22:23]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[6:7], v[6:7], v[22:23]
-; GFX8-NEXT:    v_max_f64 v[16:17], v[8:9], v[24:25]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[8:9], v[6:7], v[22:23]
+; GFX8-NEXT:    v_max_f64 v[22:23], v[8:9], v[24:25]
 ; GFX8-NEXT:    v_cmp_u_f64_e64 s[10:11], v[8:9], v[24:25]
-; GFX8-NEXT:    v_max_f64 v[22:23], v[10:11], v[26:27]
+; GFX8-NEXT:    v_max_f64 v[24:25], v[10:11], v[26:27]
 ; GFX8-NEXT:    v_cmp_u_f64_e64 s[12:13], v[10:11], v[26:27]
-; GFX8-NEXT:    v_max_f64 v[24:25], v[12:13], v[28:29]
+; GFX8-NEXT:    v_max_f64 v[26:27], v[12:13], v[28:29]
 ; GFX8-NEXT:    v_cmp_u_f64_e64 s[14:15], v[12:13], v[28:29]
-; GFX8-NEXT:    v_cndmask_b32_e64 v0, v2, 0, s[8:9]
-; GFX8-NEXT:    v_cndmask_b32_e64 v1, v3, v34, s[8:9]
-; GFX8-NEXT:    v_cndmask_b32_e64 v2, v32, 0, vcc
-; GFX8-NEXT:    v_cndmask_b32_e32 v3, v33, v34, vcc
-; GFX8-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[6:7]
-; GFX8-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[6:7]
-; GFX8-NEXT:    v_cndmask_b32_e64 v8, v16, 0, s[10:11]
-; GFX8-NEXT:    v_cndmask_b32_e64 v9, v17, v34, s[10:11]
-; GFX8-NEXT:    v_cndmask_b32_e64 v10, v22, 0, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e64 v11, v23, v34, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e64 v12, v24, 0, s[14:15]
-; GFX8-NEXT:    v_cndmask_b32_e64 v13, v25, v34, s[14:15]
+; GFX8-NEXT:    v_cndmask_b32_e64 v0, v32, 0, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v1, v33, v34, vcc
+; GFX8-NEXT:    v_cndmask_b32_e64 v2, v16, 0, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v3, v17, v34, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[6:7]
+; GFX8-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[6:7]
+; GFX8-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[8:9]
+; GFX8-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[8:9]
+; GFX8-NEXT:    v_cndmask_b32_e64 v8, v22, 0, s[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v9, v23, v34, s[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v10, v24, 0, s[12:13]
+; GFX8-NEXT:    v_cndmask_b32_e64 v11, v25, v34, s[12:13]
+; GFX8-NEXT:    v_cndmask_b32_e64 v12, v26, 0, s[14:15]
+; GFX8-NEXT:    v_cndmask_b32_e64 v13, v27, v34, s[14:15]
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_max_f64 v[18:19], v[14:15], v[30:31]
+; GFX8-NEXT:    v_max_f64 v[16:17], v[14:15], v[30:31]
 ; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[30:31]
-; GFX8-NEXT:    v_cndmask_b32_e64 v14, v18, 0, vcc
-; GFX8-NEXT:    v_cndmask_b32_e32 v15, v19, v34, vcc
+; GFX8-NEXT:    v_cndmask_b32_e64 v14, v16, 0, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v15, v17, v34, vcc
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX900-LABEL: v_maximum_v8f64:
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX900-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX900-NEXT:    v_max_f64 v[32:33], v[2:3], v[18:19]
-; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[18:19]
-; GFX900-NEXT:    v_max_f64 v[18:19], v[4:5], v[20:21]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], v[4:5], v[20:21]
-; GFX900-NEXT:    v_max_f64 v[2:3], v[0:1], v[16:17]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[8:9], v[0:1], v[16:17]
+; GFX900-NEXT:    v_max_f64 v[32:33], v[0:1], v[16:17]
+; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[16:17]
+; GFX900-NEXT:    v_max_f64 v[16:17], v[2:3], v[18:19]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], v[2:3], v[18:19]
 ; GFX900-NEXT:    v_mov_b32_e32 v34, 0x7ff80000
+; GFX900-NEXT:    v_max_f64 v[18:19], v[4:5], v[20:21]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[6:7], v[4:5], v[20:21]
 ; GFX900-NEXT:    v_max_f64 v[20:21], v[6:7], v[22:23]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[6:7], v[6:7], v[22:23]
-; GFX900-NEXT:    v_max_f64 v[16:17], v[8:9], v[24:25]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[8:9], v[6:7], v[22:23]
+; GFX900-NEXT:    v_max_f64 v[22:23], v[8:9], v[24:25]
 ; GFX900-NEXT:    v_cmp_u_f64_e64 s[10:11], v[8:9], v[24:25]
-; GFX900-NEXT:    v_max_f64 v[22:23], v[10:11], v[26:27]
+; GFX900-NEXT:    v_max_f64 v[24:25], v[10:11], v[26:27]
 ; GFX900-NEXT:    v_cmp_u_f64_e64 s[12:13], v[10:11], v[26:27]
-; GFX900-NEXT:    v_max_f64 v[24:25], v[12:13], v[28:29]
+; GFX900-NEXT:    v_max_f64 v[26:27], v[12:13], v[28:29]
 ; GFX900-NEXT:    v_cmp_u_f64_e64 s[14:15], v[12:13], v[28:29]
-; GFX900-NEXT:    v_cndmask_b32_e64 v0, v2, 0, s[8:9]
-; GFX900-NEXT:    v_cndmask_b32_e64 v1, v3, v34, s[8:9]
-; GFX900-NEXT:    v_cndmask_b32_e64 v2, v32, 0, vcc
-; GFX900-NEXT:    v_cndmask_b32_e32 v3, v33, v34, vcc
-; GFX900-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[6:7]
-; GFX900-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[6:7]
-; GFX900-NEXT:    v_cndmask_b32_e64 v8, v16, 0, s[10:11]
-; GFX900-NEXT:    v_cndmask_b32_e64 v9, v17, v34, s[10:11]
-; GFX900-NEXT:    v_cndmask_b32_e64 v10, v22, 0, s[12:13]
-; GFX900-NEXT:    v_cndmask_b32_e64 v11, v23, v34, s[12:13]
-; GFX900-NEXT:    v_cndmask_b32_e64 v12, v24, 0, s[14:15]
-; GFX900-NEXT:    v_cndmask_b32_e64 v13, v25, v34, s[14:15]
+; GFX900-NEXT:    v_cndmask_b32_e64 v0, v32, 0, vcc
+; GFX900-NEXT:    v_cndmask_b32_e32 v1, v33, v34, vcc
+; GFX900-NEXT:    v_cndmask_b32_e64 v2, v16, 0, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v3, v17, v34, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[6:7]
+; GFX900-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[6:7]
+; GFX900-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[8:9]
+; GFX900-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[8:9]
+; GFX900-NEXT:    v_cndmask_b32_e64 v8, v22, 0, s[10:11]
+; GFX900-NEXT:    v_cndmask_b32_e64 v9, v23, v34, s[10:11]
+; GFX900-NEXT:    v_cndmask_b32_e64 v10, v24, 0, s[12:13]
+; GFX900-NEXT:    v_cndmask_b32_e64 v11, v25, v34, s[12:13]
+; GFX900-NEXT:    v_cndmask_b32_e64 v12, v26, 0, s[14:15]
+; GFX900-NEXT:    v_cndmask_b32_e64 v13, v27, v34, s[14:15]
 ; GFX900-NEXT:    s_waitcnt vmcnt(0)
-; GFX900-NEXT:    v_max_f64 v[18:19], v[14:15], v[30:31]
+; GFX900-NEXT:    v_max_f64 v[16:17], v[14:15], v[30:31]
 ; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[30:31]
-; GFX900-NEXT:    v_cndmask_b32_e64 v14, v18, 0, vcc
-; GFX900-NEXT:    v_cndmask_b32_e32 v15, v19, v34, vcc
+; GFX900-NEXT:    v_cndmask_b32_e64 v14, v16, 0, vcc
+; GFX900-NEXT:    v_cndmask_b32_e32 v15, v17, v34, vcc
 ; GFX900-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX950-LABEL: v_maximum_v8f64:
@@ -2365,24 +2365,24 @@ define <16 x double> @v_maximum_v16f64(<16 x double> %src0, <16 x double> %src1)
 ; GFX950-LABEL: v_maximum_v16f64:
 ; GFX950:       ; %bb.0:
 ; GFX950-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX950-NEXT:    v_accvgpr_write_b32 a1, v40 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a2, v41 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a3, v42 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a4, v43 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a5, v44 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a6, v45 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a7, v46 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a8, v47 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a9, v56 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a10, v57 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a0, v40 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a1, v41 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a2, v42 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a3, v43 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a4, v44 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a5, v45 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a6, v46 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a7, v47 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a8, v56 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a9, v57 ; Reload Reuse
+; GFX950-NEXT:    scratch_load_dword v33, off, s32 offset:8
+; GFX950-NEXT:    scratch_load_dword v32, off, s32 offset:4
 ; GFX950-NEXT:    scratch_load_dword v37, off, s32 offset:16
 ; GFX950-NEXT:    scratch_load_dword v36, off, s32 offset:12
 ; GFX950-NEXT:    scratch_load_dword v39, off, s32 offset:24
 ; GFX950-NEXT:    scratch_load_dword v38, off, s32 offset:20
-; GFX950-NEXT:    scratch_load_dword v49, off, s32 offset:32
-; GFX950-NEXT:    scratch_load_dword v48, off, s32 offset:28
-; GFX950-NEXT:    scratch_load_dword v57, off, s32 offset:8
-; GFX950-NEXT:    scratch_load_dword v56, off, s32 offset:4
+; GFX950-NEXT:    scratch_load_dword v57, off, s32 offset:32
+; GFX950-NEXT:    scratch_load_dword v56, off, s32 offset:28
 ; GFX950-NEXT:    scratch_load_dword v47, off, s32 offset:40
 ; GFX950-NEXT:    scratch_load_dword v46, off, s32 offset:36
 ; GFX950-NEXT:    scratch_load_dword v45, off, s32 offset:48
@@ -2397,148 +2397,149 @@ define <16 x double> @v_maximum_v16f64(<16 x double> %src0, <16 x double> %src1)
 ; GFX950-NEXT:    scratch_load_dword v52, off, s32 offset:76
 ; GFX950-NEXT:    scratch_load_dword v51, off, s32 offset:88
 ; GFX950-NEXT:    scratch_load_dword v50, off, s32 offset:84
-; GFX950-NEXT:    scratch_load_dword v35, off, s32 offset:96
-; GFX950-NEXT:    scratch_load_dword v34, off, s32 offset:92
+; GFX950-NEXT:    scratch_load_dword v49, off, s32 offset:96
+; GFX950-NEXT:    scratch_load_dword v48, off, s32 offset:92
 ; GFX950-NEXT:    scratch_load_dword v31, off, s32
-; GFX950-NEXT:    scratch_load_dword v33, off, s32 offset:104
-; GFX950-NEXT:    scratch_load_dword v32, off, s32 offset:100
-; GFX950-NEXT:    v_accvgpr_write_b32 a11, v58 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a12, v59 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a13, v60 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a14, v61 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a15, v62 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a16, v63 ; Reload Reuse
+; GFX950-NEXT:    scratch_load_dword v35, off, s32 offset:104
+; GFX950-NEXT:    scratch_load_dword v34, off, s32 offset:100
+; GFX950-NEXT:    v_accvgpr_write_b32 a10, v58 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a11, v59 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a12, v60 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a13, v61 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a14, v62 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a15, v63 ; Reload Reuse
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_max_f64 v[58:59], v[2:3], v[36:37]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[36:37]
-; GFX950-NEXT:    scratch_load_dword v37, off, s32 offset:112
-; GFX950-NEXT:    scratch_load_dword v36, off, s32 offset:108
+; GFX950-NEXT:    v_max_f64 v[58:59], v[0:1], v[32:33]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[32:33]
+; GFX950-NEXT:    scratch_load_dword v33, off, s32 offset:112
+; GFX950-NEXT:    scratch_load_dword v32, off, s32 offset:108
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_max_f64 v[60:61], v[4:5], v[38:39]
-; GFX950-NEXT:    v_cmp_u_f64_e64 s[0:1], v[4:5], v[38:39]
-; GFX950-NEXT:    scratch_load_dword v39, off, s32 offset:120
-; GFX950-NEXT:    scratch_load_dword v38, off, s32 offset:116
+; GFX950-NEXT:    v_max_f64 v[60:61], v[2:3], v[36:37]
+; GFX950-NEXT:    v_cmp_u_f64_e64 s[0:1], v[2:3], v[36:37]
+; GFX950-NEXT:    scratch_load_dword v37, off, s32 offset:120
+; GFX950-NEXT:    scratch_load_dword v36, off, s32 offset:116
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_max_f64 v[62:63], v[6:7], v[48:49]
-; GFX950-NEXT:    v_cmp_u_f64_e64 s[2:3], v[6:7], v[48:49]
-; GFX950-NEXT:    scratch_load_dword v49, off, s32 offset:128
-; GFX950-NEXT:    scratch_load_dword v48, off, s32 offset:124
+; GFX950-NEXT:    v_max_f64 v[62:63], v[4:5], v[38:39]
+; GFX950-NEXT:    v_cmp_u_f64_e64 s[2:3], v[4:5], v[38:39]
+; GFX950-NEXT:    scratch_load_dword v39, off, s32 offset:128
+; GFX950-NEXT:    scratch_load_dword v38, off, s32 offset:124
+; GFX950-NEXT:    v_mov_b32_e32 v2, 0x7ff80000
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_max_f64 v[2:3], v[0:1], v[56:57]
-; GFX950-NEXT:    v_cmp_u_f64_e64 s[4:5], v[0:1], v[56:57]
-; GFX950-NEXT:    v_mov_b32_e32 v0, 0x7ff80000
+; GFX950-NEXT:    v_max_f64 v[0:1], v[6:7], v[56:57]
+; GFX950-NEXT:    v_cmp_u_f64_e64 s[4:5], v[6:7], v[56:57]
 ; GFX950-NEXT:    s_waitcnt vmcnt(23)
 ; GFX950-NEXT:    v_max_f64 v[56:57], v[8:9], v[46:47]
-; GFX950-NEXT:    v_cndmask_b32_e64 v1, v2, 0, s[4:5]
-; GFX950-NEXT:    v_accvgpr_write_b32 a0, v1
-; GFX950-NEXT:    v_cndmask_b32_e64 v1, v3, v0, s[4:5]
-; GFX950-NEXT:    v_cndmask_b32_e64 v2, v58, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v3, v59, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e64 v58, v58, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v59, v59, v2, vcc
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[8:9], v[46:47]
-; GFX950-NEXT:    s_waitcnt vmcnt(21)
-; GFX950-NEXT:    v_max_f64 v[46:47], v[10:11], v[44:45]
-; GFX950-NEXT:    v_cndmask_b32_e64 v4, v60, 0, s[0:1]
+; GFX950-NEXT:    v_cndmask_b32_e64 v6, v0, 0, s[4:5]
+; GFX950-NEXT:    v_cndmask_b32_e64 v7, v1, v2, s[4:5]
 ; GFX950-NEXT:    v_cndmask_b32_e64 v8, v56, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v9, v57, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v9, v57, v2, vcc
+; GFX950-NEXT:    s_waitcnt vmcnt(21)
+; GFX950-NEXT:    v_max_f64 v[0:1], v[10:11], v[44:45]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[10:11], v[44:45]
+; GFX950-NEXT:    v_cndmask_b32_e64 v60, v60, 0, s[0:1]
+; GFX950-NEXT:    v_cndmask_b32_e64 v3, v61, v2, s[0:1]
+; GFX950-NEXT:    v_cndmask_b32_e64 v10, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v11, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(19)
-; GFX950-NEXT:    v_max_f64 v[44:45], v[12:13], v[42:43]
-; GFX950-NEXT:    v_cndmask_b32_e64 v5, v61, v0, s[0:1]
-; GFX950-NEXT:    v_cndmask_b32_e64 v10, v46, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v11, v47, v0, vcc
+; GFX950-NEXT:    v_max_f64 v[0:1], v[12:13], v[42:43]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[12:13], v[42:43]
+; GFX950-NEXT:    v_cndmask_b32_e64 v4, v62, 0, s[2:3]
+; GFX950-NEXT:    v_cndmask_b32_e64 v5, v63, v2, s[2:3]
+; GFX950-NEXT:    v_cndmask_b32_e64 v12, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v13, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(17)
-; GFX950-NEXT:    v_max_f64 v[42:43], v[14:15], v[40:41]
-; GFX950-NEXT:    v_cndmask_b32_e64 v6, v62, 0, s[2:3]
-; GFX950-NEXT:    v_cndmask_b32_e64 v12, v44, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v13, v45, v0, vcc
+; GFX950-NEXT:    v_max_f64 v[0:1], v[14:15], v[40:41]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[40:41]
+; GFX950-NEXT:    v_accvgpr_read_b32 v63, a15 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v62, a14 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v14, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v15, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(15)
-; GFX950-NEXT:    v_max_f64 v[40:41], v[16:17], v[54:55]
-; GFX950-NEXT:    v_cndmask_b32_e64 v7, v63, v0, s[2:3]
-; GFX950-NEXT:    v_cndmask_b32_e64 v14, v42, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v15, v43, v0, vcc
+; GFX950-NEXT:    v_max_f64 v[0:1], v[16:17], v[54:55]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[16:17], v[54:55]
+; GFX950-NEXT:    v_accvgpr_read_b32 v61, a13 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v57, a9 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v16, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v17, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(13)
-; GFX950-NEXT:    v_max_f64 v[54:55], v[18:19], v[52:53]
-; GFX950-NEXT:    v_accvgpr_read_b32 v63, a16 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v16, v40, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v17, v41, v0, vcc
+; GFX950-NEXT:    v_max_f64 v[0:1], v[18:19], v[52:53]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[18:19], v[52:53]
+; GFX950-NEXT:    v_accvgpr_read_b32 v56, a8 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v47, a7 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v18, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v19, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(11)
-; GFX950-NEXT:    v_max_f64 v[52:53], v[20:21], v[50:51]
-; GFX950-NEXT:    v_accvgpr_read_b32 v62, a15 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v18, v54, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v19, v55, v0, vcc
+; GFX950-NEXT:    v_max_f64 v[0:1], v[20:21], v[50:51]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[20:21], v[50:51]
+; GFX950-NEXT:    v_accvgpr_read_b32 v46, a6 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v45, a5 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v20, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v21, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(9)
-; GFX950-NEXT:    v_max_f64 v[50:51], v[22:23], v[34:35]
-; GFX950-NEXT:    v_accvgpr_read_b32 v61, a14 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v20, v52, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v21, v53, v0, vcc
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[22:23], v[34:35]
+; GFX950-NEXT:    v_max_f64 v[0:1], v[22:23], v[48:49]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[22:23], v[48:49]
+; GFX950-NEXT:    v_accvgpr_read_b32 v44, a4 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v43, a3 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v22, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v23, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(6)
-; GFX950-NEXT:    v_max_f64 v[34:35], v[24:25], v[32:33]
-; GFX950-NEXT:    v_accvgpr_read_b32 v60, a13 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v22, v50, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v23, v51, v0, vcc
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[24:25], v[32:33]
-; GFX950-NEXT:    v_accvgpr_read_b32 v59, a12 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v58, a11 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v24, v34, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v25, v35, v0, vcc
-; GFX950-NEXT:    v_accvgpr_read_b32 v57, a10 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v56, a9 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v47, a8 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v46, a7 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v45, a6 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v44, a5 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v43, a4 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v42, a3 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v41, a2 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v40, a1 ; Reload Reuse
+; GFX950-NEXT:    v_max_f64 v[0:1], v[24:25], v[34:35]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[24:25], v[34:35]
+; GFX950-NEXT:    v_accvgpr_read_b32 v42, a2 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v41, a1 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v24, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v25, v1, v2, vcc
+; GFX950-NEXT:    v_accvgpr_read_b32 v40, a0 ; Reload Reuse
 ; GFX950-NEXT:    s_waitcnt vmcnt(4)
-; GFX950-NEXT:    v_max_f64 v[32:33], v[26:27], v[36:37]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[26:27], v[36:37]
+; GFX950-NEXT:    v_max_f64 v[0:1], v[26:27], v[32:33]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[26:27], v[32:33]
 ; GFX950-NEXT:    s_nop 1
-; GFX950-NEXT:    v_cndmask_b32_e64 v26, v32, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v27, v33, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e64 v26, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v27, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(2)
-; GFX950-NEXT:    v_max_f64 v[32:33], v[28:29], v[38:39]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[28:29], v[38:39]
+; GFX950-NEXT:    v_max_f64 v[0:1], v[28:29], v[36:37]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[28:29], v[36:37]
 ; GFX950-NEXT:    s_nop 1
-; GFX950-NEXT:    v_cndmask_b32_e64 v28, v32, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v29, v33, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e64 v28, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v29, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(0)
-; GFX950-NEXT:    v_max_f64 v[32:33], v[30:31], v[48:49]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[30:31], v[48:49]
+; GFX950-NEXT:    v_max_f64 v[0:1], v[30:31], v[38:39]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[30:31], v[38:39]
 ; GFX950-NEXT:    s_nop 1
-; GFX950-NEXT:    v_cndmask_b32_e64 v30, v32, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v31, v33, v0, vcc
-; GFX950-NEXT:    v_accvgpr_read_b32 v0, a0
+; GFX950-NEXT:    v_cndmask_b32_e64 v30, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v31, v1, v2, vcc
+; GFX950-NEXT:    v_mov_b32_e32 v0, v58
+; GFX950-NEXT:    v_mov_b32_e32 v1, v59
+; GFX950-NEXT:    v_mov_b32_e32 v2, v60
+; GFX950-NEXT:    v_accvgpr_read_b32 v60, a12 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v59, a11 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v58, a10 ; Reload Reuse
 ; GFX950-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: v_maximum_v16f64:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    s_clause 0x19
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:16
-; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:12
-; GFX10-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:24
-; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:20
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:32
-; GFX10-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:28
+; GFX10-NEXT:    s_clause 0x18
+; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
+; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
+; GFX10-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:16
+; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:12
+; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:24
+; GFX10-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:20
 ; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:36
-; GFX10-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:68
-; GFX10-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:64
-; GFX10-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:60
-; GFX10-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:56
-; GFX10-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
-; GFX10-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:48
-; GFX10-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
+; GFX10-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:32
+; GFX10-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:28
+; GFX10-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:68
+; GFX10-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:64
+; GFX10-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
+; GFX10-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:56
+; GFX10-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
+; GFX10-NEXT:    buffer_load_dword v65, off, s[0:3], s32 offset:48
+; GFX10-NEXT:    buffer_load_dword v64, off, s[0:3], s32 offset:44
 ; GFX10-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:40
-; GFX10-NEXT:    buffer_load_dword v65, off, s[0:3], s32 offset:8
-; GFX10-NEXT:    buffer_load_dword v64, off, s[0:3], s32 offset:4
 ; GFX10-NEXT:    buffer_load_dword v66, off, s[0:3], s32 offset:100
 ; GFX10-NEXT:    buffer_load_dword v69, off, s[0:3], s32 offset:96
 ; GFX10-NEXT:    buffer_load_dword v68, off, s[0:3], s32 offset:92
@@ -2546,96 +2547,95 @@ define <16 x double> @v_maximum_v16f64(<16 x double> %src0, <16 x double> %src1)
 ; GFX10-NEXT:    buffer_load_dword v70, off, s[0:3], s32 offset:84
 ; GFX10-NEXT:    buffer_load_dword v81, off, s[0:3], s32 offset:80
 ; GFX10-NEXT:    buffer_load_dword v80, off, s[0:3], s32 offset:76
-; GFX10-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:72
+; GFX10-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:72
+; GFX10-NEXT:    s_waitcnt vmcnt(23)
+; GFX10-NEXT:    v_max_f64 v[82:83], v[0:1], v[31:32]
+; GFX10-NEXT:    v_cmp_u_f64_e32 vcc_lo, v[0:1], v[31:32]
+; GFX10-NEXT:    s_waitcnt vmcnt(21)
+; GFX10-NEXT:    v_max_f64 v[84:85], v[2:3], v[33:34]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s4, v[2:3], v[33:34]
+; GFX10-NEXT:    s_waitcnt vmcnt(19)
+; GFX10-NEXT:    v_max_f64 v[32:33], v[4:5], v[35:36]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s5, v[4:5], v[35:36]
+; GFX10-NEXT:    s_clause 0x7
+; GFX10-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:112
 ; GFX10-NEXT:    buffer_load_dword v67, off, s[0:3], s32 offset:104
-; GFX10-NEXT:    s_waitcnt vmcnt(24)
-; GFX10-NEXT:    v_max_f64 v[82:83], v[2:3], v[31:32]
-; GFX10-NEXT:    v_cmp_u_f64_e32 vcc_lo, v[2:3], v[31:32]
-; GFX10-NEXT:    s_waitcnt vmcnt(22)
-; GFX10-NEXT:    v_max_f64 v[84:85], v[4:5], v[33:34]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s4, v[4:5], v[33:34]
-; GFX10-NEXT:    s_clause 0x3
+; GFX10-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:108
 ; GFX10-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:120
 ; GFX10-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:116
-; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:112
-; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:108
-; GFX10-NEXT:    s_waitcnt vmcnt(24)
-; GFX10-NEXT:    v_max_f64 v[32:33], v[6:7], v[35:36]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s5, v[6:7], v[35:36]
-; GFX10-NEXT:    s_clause 0x2
 ; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX10-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
-; GFX10-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:124
-; GFX10-NEXT:    s_waitcnt vmcnt(23)
-; GFX10-NEXT:    v_cmp_u_f64_e64 s10, v[14:15], v[50:51]
+; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:128
+; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:124
+; GFX10-NEXT:    s_waitcnt vmcnt(24)
+; GFX10-NEXT:    v_max_f64 v[34:35], v[6:7], v[48:49]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s6, v[6:7], v[48:49]
 ; GFX10-NEXT:    s_waitcnt vmcnt(21)
-; GFX10-NEXT:    v_cmp_u_f64_e64 s9, v[12:13], v[52:53]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s10, v[14:15], v[52:53]
 ; GFX10-NEXT:    s_waitcnt vmcnt(19)
-; GFX10-NEXT:    v_cmp_u_f64_e64 s7, v[10:11], v[54:55]
-; GFX10-NEXT:    s_waitcnt vmcnt(18)
-; GFX10-NEXT:    v_max_f64 v[34:35], v[8:9], v[37:38]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s6, v[8:9], v[37:38]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s9, v[12:13], v[54:55]
+; GFX10-NEXT:    s_waitcnt vmcnt(17)
+; GFX10-NEXT:    v_cmp_u_f64_e64 s8, v[10:11], v[64:65]
 ; GFX10-NEXT:    s_waitcnt vmcnt(16)
-; GFX10-NEXT:    v_max_f64 v[8:9], v[0:1], v[64:65]
-; GFX10-NEXT:    v_max_f64 v[36:37], v[10:11], v[54:55]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s8, v[0:1], v[64:65]
-; GFX10-NEXT:    v_max_f64 v[38:39], v[12:13], v[52:53]
-; GFX10-NEXT:    v_max_f64 v[52:53], v[14:15], v[50:51]
+; GFX10-NEXT:    v_max_f64 v[48:49], v[8:9], v[37:38]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s7, v[8:9], v[37:38]
+; GFX10-NEXT:    v_max_f64 v[36:37], v[10:11], v[64:65]
+; GFX10-NEXT:    v_max_f64 v[38:39], v[12:13], v[54:55]
+; GFX10-NEXT:    v_max_f64 v[54:55], v[14:15], v[52:53]
 ; GFX10-NEXT:    s_waitcnt vmcnt(11)
-; GFX10-NEXT:    v_max_f64 v[54:55], v[20:21], v[70:71]
+; GFX10-NEXT:    v_max_f64 v[64:65], v[20:21], v[70:71]
 ; GFX10-NEXT:    v_cmp_u_f64_e64 s13, v[20:21], v[70:71]
 ; GFX10-NEXT:    s_waitcnt vmcnt(9)
 ; GFX10-NEXT:    v_cmp_u_f64_e64 s12, v[18:19], v[80:81]
 ; GFX10-NEXT:    s_waitcnt vmcnt(8)
-; GFX10-NEXT:    v_max_f64 v[50:51], v[16:17], v[48:49]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s11, v[16:17], v[48:49]
-; GFX10-NEXT:    v_max_f64 v[48:49], v[18:19], v[80:81]
-; GFX10-NEXT:    v_max_f64 v[64:65], v[22:23], v[68:69]
+; GFX10-NEXT:    v_max_f64 v[52:53], v[16:17], v[50:51]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s11, v[16:17], v[50:51]
+; GFX10-NEXT:    v_max_f64 v[50:51], v[18:19], v[80:81]
+; GFX10-NEXT:    v_max_f64 v[70:71], v[22:23], v[68:69]
 ; GFX10-NEXT:    v_cmp_u_f64_e64 s14, v[22:23], v[68:69]
-; GFX10-NEXT:    s_waitcnt vmcnt(7)
-; GFX10-NEXT:    v_max_f64 v[68:69], v[24:25], v[66:67]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s15, v[24:25], v[66:67]
-; GFX10-NEXT:    v_cndmask_b32_e64 v10, v36, 0, s7
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v8, 0, s8
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v9, 0x7ff80000, s8
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v34, 0, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, v35, 0x7ff80000, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v11, v37, 0x7ff80000, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v34, 0, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v35, 0x7ff80000, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, v48, 0, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v9, v49, 0x7ff80000, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v10, v36, 0, s8
+; GFX10-NEXT:    v_cndmask_b32_e64 v11, v37, 0x7ff80000, s8
 ; GFX10-NEXT:    v_cndmask_b32_e64 v12, v38, 0, s9
 ; GFX10-NEXT:    v_cndmask_b32_e64 v13, v39, 0x7ff80000, s9
-; GFX10-NEXT:    v_cndmask_b32_e64 v14, v52, 0, s10
-; GFX10-NEXT:    v_cndmask_b32_e64 v15, v53, 0x7ff80000, s10
-; GFX10-NEXT:    v_cndmask_b32_e64 v16, v50, 0, s11
-; GFX10-NEXT:    v_cndmask_b32_e64 v17, v51, 0x7ff80000, s11
-; GFX10-NEXT:    v_cndmask_b32_e64 v18, v48, 0, s12
-; GFX10-NEXT:    v_cndmask_b32_e64 v19, v49, 0x7ff80000, s12
-; GFX10-NEXT:    v_cndmask_b32_e64 v20, v54, 0, s13
-; GFX10-NEXT:    v_cndmask_b32_e64 v21, v55, 0x7ff80000, s13
-; GFX10-NEXT:    v_cndmask_b32_e64 v22, v64, 0, s14
-; GFX10-NEXT:    v_cndmask_b32_e64 v23, v65, 0x7ff80000, s14
-; GFX10-NEXT:    v_cndmask_b32_e64 v24, v68, 0, s15
-; GFX10-NEXT:    v_cndmask_b32_e64 v25, v69, 0x7ff80000, s15
+; GFX10-NEXT:    v_cndmask_b32_e64 v14, v54, 0, s10
+; GFX10-NEXT:    v_cndmask_b32_e64 v15, v55, 0x7ff80000, s10
+; GFX10-NEXT:    v_cndmask_b32_e64 v16, v52, 0, s11
+; GFX10-NEXT:    v_cndmask_b32_e64 v17, v53, 0x7ff80000, s11
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v50, 0, s12
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v51, 0x7ff80000, s12
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v64, 0, s13
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v65, 0x7ff80000, s13
+; GFX10-NEXT:    v_cndmask_b32_e64 v22, v70, 0, s14
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v71, 0x7ff80000, s14
+; GFX10-NEXT:    s_waitcnt vmcnt(6)
+; GFX10-NEXT:    v_max_f64 v[68:69], v[24:25], v[66:67]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s15, v[24:25], v[66:67]
 ; GFX10-NEXT:    s_waitcnt vmcnt(5)
-; GFX10-NEXT:    v_max_f64 v[70:71], v[28:29], v[2:3]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s17, v[28:29], v[2:3]
+; GFX10-NEXT:    v_max_f64 v[66:67], v[26:27], v[0:1]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s16, v[26:27], v[0:1]
 ; GFX10-NEXT:    s_waitcnt vmcnt(3)
-; GFX10-NEXT:    v_max_f64 v[66:67], v[26:27], v[4:5]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s16, v[26:27], v[4:5]
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v82, 0, vcc_lo
+; GFX10-NEXT:    v_max_f64 v[80:81], v[28:29], v[2:3]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s17, v[28:29], v[2:3]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_max_f64 v[80:81], v[30:31], v[6:7]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s18, v[30:31], v[6:7]
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v83, 0x7ff80000, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, v84, 0, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v85, 0x7ff80000, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v32, 0, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v7, v33, 0x7ff80000, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v28, v70, 0, s17
-; GFX10-NEXT:    v_cndmask_b32_e64 v29, v71, 0x7ff80000, s17
+; GFX10-NEXT:    v_max_f64 v[86:87], v[30:31], v[4:5]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s18, v[30:31], v[4:5]
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v82, 0, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v83, 0x7ff80000, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v84, 0, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v85, 0x7ff80000, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v32, 0, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v33, 0x7ff80000, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v68, 0, s15
+; GFX10-NEXT:    v_cndmask_b32_e64 v25, v69, 0x7ff80000, s15
 ; GFX10-NEXT:    v_cndmask_b32_e64 v26, v66, 0, s16
 ; GFX10-NEXT:    v_cndmask_b32_e64 v27, v67, 0x7ff80000, s16
-; GFX10-NEXT:    v_cndmask_b32_e64 v30, v80, 0, s18
-; GFX10-NEXT:    v_cndmask_b32_e64 v31, v81, 0x7ff80000, s18
+; GFX10-NEXT:    v_cndmask_b32_e64 v28, v80, 0, s17
+; GFX10-NEXT:    v_cndmask_b32_e64 v29, v81, 0x7ff80000, s17
+; GFX10-NEXT:    v_cndmask_b32_e64 v30, v86, 0, s18
+; GFX10-NEXT:    v_cndmask_b32_e64 v31, v87, 0x7ff80000, s18
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_maximum_v16f64:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll b/llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll
index f6d37b34807b1..6cc529e8195ea 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll
@@ -1874,6 +1874,7 @@ define <16 x half> @v_minimum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[10:11], v18, v17
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v17, 16, v9
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v18, 16, v1
+; GFX8-NEXT:    v_mov_b32_e32 v19, 0x7e00
 ; GFX8-NEXT:    v_min_f16_e32 v24, v18, v17
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[12:13], v18, v17
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v17, 16, v8
@@ -1888,28 +1889,26 @@ define <16 x half> @v_minimum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[20:21], v4, v12
 ; GFX8-NEXT:    v_min_f16_e32 v4, v3, v11
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[22:23], v3, v11
+; GFX8-NEXT:    v_min_f16_e32 v3, v2, v10
 ; GFX8-NEXT:    v_min_f16_e32 v11, v7, v15
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[24:25], v7, v15
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v12, 16, v15
 ; GFX8-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
-; GFX8-NEXT:    v_mov_b32_e32 v19, 0x7e00
+; GFX8-NEXT:    v_cndmask_b32_e32 v14, v19, v16, vcc
+; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v2, v10
 ; GFX8-NEXT:    v_min_f16_e32 v13, v7, v12
 ; GFX8-NEXT:    v_cmp_o_f16_e64 s[26:27], v7, v12
-; GFX8-NEXT:    v_min_f16_e32 v3, v2, v10
-; GFX8-NEXT:    v_cndmask_b32_e64 v12, v19, v13, s[26:27]
-; GFX8-NEXT:    v_cndmask_b32_e32 v13, v19, v16, vcc
-; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v2, v10
-; GFX8-NEXT:    v_min_f16_e32 v14, v1, v9
+; GFX8-NEXT:    v_min_f16_e32 v7, v1, v9
 ; GFX8-NEXT:    v_cndmask_b32_e32 v2, v19, v3, vcc
 ; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v1, v9
-; GFX8-NEXT:    v_min_f16_e32 v7, v0, v8
+; GFX8-NEXT:    v_min_f16_e32 v12, v0, v8
 ; GFX8-NEXT:    v_cndmask_b32_e64 v18, v19, v22, s[8:9]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v22, v19, v25, s[14:15]
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v19, v14, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v1, v19, v7, vcc
 ; GFX8-NEXT:    v_cmp_o_f16_e32 vcc, v0, v8
 ; GFX8-NEXT:    v_cndmask_b32_e64 v16, v19, v21, s[6:7]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v21, v19, v24, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e32 v0, v19, v7, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v0, v19, v12, vcc
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v3, 16, v22
 ; GFX8-NEXT:    v_cndmask_b32_e64 v15, v19, v20, s[4:5]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v20, v19, v23, s[10:11]
@@ -1923,14 +1922,15 @@ define <16 x half> @v_minimum_v16f16(<16 x half> %src0, <16 x half> %src1) {
 ; GFX8-NEXT:    v_cndmask_b32_e64 v5, v19, v5, s[20:21]
 ; GFX8-NEXT:    v_or_b32_sdwa v3, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v4, 16, v16
+; GFX8-NEXT:    v_cndmask_b32_e64 v13, v19, v13, s[26:27]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v6, v19, v6, s[18:19]
 ; GFX8-NEXT:    v_or_b32_sdwa v4, v5, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_lshlrev_b32_e32 v5, 16, v15
 ; GFX8-NEXT:    v_cndmask_b32_e64 v11, v19, v11, s[24:25]
 ; GFX8-NEXT:    v_cndmask_b32_e64 v17, v19, v17, s[16:17]
 ; GFX8-NEXT:    v_or_b32_sdwa v5, v6, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_lshlrev_b32_e32 v6, 16, v13
-; GFX8-NEXT:    v_lshlrev_b32_e32 v7, 16, v12
+; GFX8-NEXT:    v_lshlrev_b32_e32 v6, 16, v14
+; GFX8-NEXT:    v_lshlrev_b32_e32 v7, 16, v13
 ; GFX8-NEXT:    v_or_b32_sdwa v6, v17, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_or_b32_sdwa v7, v11, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll b/llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll
index 7b2998cbd242f..0215795467323 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll
@@ -1684,7 +1684,7 @@ define <8 x float> @v_minimum_v8f32(<8 x float> %src0, <8 x float> %src1) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-NEXT:    v_cndmask_b32_e32 v0, 0x7fc00000, v16, vcc_lo
 ; GFX11-NEXT:    v_cmp_o_f32_e32 vcc_lo, v1, v9
-; GFX11-NEXT:    v_dual_min_f32 v9, v3, v11 :: v_dual_min_f32 v8, v2, v10
+; GFX11-NEXT:    v_dual_min_f32 v8, v2, v10 :: v_dual_min_f32 v9, v3, v11
 ; GFX11-NEXT:    v_cndmask_b32_e32 v1, 0x7fc00000, v17, vcc_lo
 ; GFX11-NEXT:    v_cmp_o_f32_e32 vcc_lo, v2, v10
 ; GFX11-NEXT:    v_min_f32_e32 v10, v7, v15
@@ -1727,169 +1727,169 @@ define <16 x float> @v_minimum_v16f32(<16 x float> %src0, <16 x float> %src1) {
 ; GFX7-LABEL: v_minimum_v16f32:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX7-NEXT:    v_min_f32_e32 v0, v0, v16
+; GFX7-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v17
 ; GFX7-NEXT:    v_min_f32_e32 v1, v1, v17
-; GFX7-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v18
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v18
 ; GFX7-NEXT:    v_min_f32_e32 v2, v2, v18
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v19
+; GFX7-NEXT:    v_mov_b32_e32 v17, 0x7fc00000
+; GFX7-NEXT:    v_min_f32_e32 v18, v13, v29
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v29
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v19
 ; GFX7-NEXT:    v_min_f32_e32 v3, v3, v19
-; GFX7-NEXT:    v_mov_b32_e32 v18, 0x7fc00000
-; GFX7-NEXT:    v_min_f32_e32 v19, v0, v16
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[28:29], v0, v16
-; GFX7-NEXT:    v_min_f32_e32 v16, v14, v30
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v20
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v20
 ; GFX7-NEXT:    v_min_f32_e32 v4, v4, v20
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v21
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v21
 ; GFX7-NEXT:    v_min_f32_e32 v5, v5, v21
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v22
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v22
 ; GFX7-NEXT:    v_min_f32_e32 v6, v6, v22
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v23
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v23
 ; GFX7-NEXT:    v_min_f32_e32 v7, v7, v23
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v24
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v24
 ; GFX7-NEXT:    v_min_f32_e32 v8, v8, v24
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v25
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v25
 ; GFX7-NEXT:    v_min_f32_e32 v9, v9, v25
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v26
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v26
 ; GFX7-NEXT:    v_min_f32_e32 v10, v10, v26
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v27
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v27
 ; GFX7-NEXT:    v_min_f32_e32 v11, v11, v27
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v28
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[26:27], v12, v28
 ; GFX7-NEXT:    v_min_f32_e32 v12, v12, v28
-; GFX7-NEXT:    v_cmp_o_f32_e64 s[26:27], v13, v29
-; GFX7-NEXT:    v_min_f32_e32 v13, v13, v29
-; GFX7-NEXT:    v_cndmask_b32_e32 v1, v18, v1, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v14, v18, v16, s[40:41]
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v18, v19, s[28:29]
-; GFX7-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v3, v18, v3, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v4, v18, v4, s[8:9]
-; GFX7-NEXT:    v_cndmask_b32_e64 v5, v18, v5, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v6, v18, v6, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v7, v18, v7, s[14:15]
-; GFX7-NEXT:    v_cndmask_b32_e64 v8, v18, v8, s[16:17]
-; GFX7-NEXT:    v_cndmask_b32_e64 v9, v18, v9, s[18:19]
-; GFX7-NEXT:    v_cndmask_b32_e64 v10, v18, v10, s[20:21]
-; GFX7-NEXT:    v_cndmask_b32_e64 v11, v18, v11, s[22:23]
-; GFX7-NEXT:    v_cndmask_b32_e64 v12, v18, v12, s[24:25]
-; GFX7-NEXT:    v_cndmask_b32_e64 v13, v18, v13, s[26:27]
+; GFX7-NEXT:    v_min_f32_e32 v19, v14, v30
+; GFX7-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
+; GFX7-NEXT:    v_cndmask_b32_e32 v0, v17, v0, vcc
+; GFX7-NEXT:    v_cndmask_b32_e64 v13, v17, v18, s[28:29]
+; GFX7-NEXT:    v_cndmask_b32_e64 v1, v17, v1, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v2, v17, v2, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v3, v17, v3, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v4, v17, v4, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v5, v17, v5, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, v17, v6, s[14:15]
+; GFX7-NEXT:    v_cndmask_b32_e64 v7, v17, v7, s[16:17]
+; GFX7-NEXT:    v_cndmask_b32_e64 v8, v17, v8, s[18:19]
+; GFX7-NEXT:    v_cndmask_b32_e64 v9, v17, v9, s[20:21]
+; GFX7-NEXT:    v_cndmask_b32_e64 v10, v17, v10, s[22:23]
+; GFX7-NEXT:    v_cndmask_b32_e64 v11, v17, v11, s[24:25]
+; GFX7-NEXT:    v_cndmask_b32_e64 v12, v17, v12, s[26:27]
+; GFX7-NEXT:    v_cndmask_b32_e64 v14, v17, v19, s[40:41]
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_min_f32_e32 v16, v15, v17
-; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
-; GFX7-NEXT:    v_cndmask_b32_e32 v15, v18, v16, vcc
+; GFX7-NEXT:    v_min_f32_e32 v18, v15, v16
+; GFX7-NEXT:    v_cmp_o_f32_e32 vcc, v15, v16
+; GFX7-NEXT:    v_cndmask_b32_e32 v15, v17, v18, vcc
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_minimum_v16f32:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
+; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX8-NEXT:    v_min_f32_e32 v0, v0, v16
+; GFX8-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v17
 ; GFX8-NEXT:    v_min_f32_e32 v1, v1, v17
-; GFX8-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v18
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v18
 ; GFX8-NEXT:    v_min_f32_e32 v2, v2, v18
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v19
+; GFX8-NEXT:    v_mov_b32_e32 v17, 0x7fc00000
+; GFX8-NEXT:    v_min_f32_e32 v18, v13, v29
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v29
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v19
 ; GFX8-NEXT:    v_min_f32_e32 v3, v3, v19
-; GFX8-NEXT:    v_mov_b32_e32 v18, 0x7fc00000
-; GFX8-NEXT:    v_min_f32_e32 v19, v0, v16
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[28:29], v0, v16
-; GFX8-NEXT:    v_min_f32_e32 v16, v14, v30
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v20
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v20
 ; GFX8-NEXT:    v_min_f32_e32 v4, v4, v20
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v21
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v21
 ; GFX8-NEXT:    v_min_f32_e32 v5, v5, v21
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v22
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v22
 ; GFX8-NEXT:    v_min_f32_e32 v6, v6, v22
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v23
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v23
 ; GFX8-NEXT:    v_min_f32_e32 v7, v7, v23
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v24
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v24
 ; GFX8-NEXT:    v_min_f32_e32 v8, v8, v24
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v25
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v25
 ; GFX8-NEXT:    v_min_f32_e32 v9, v9, v25
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v26
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v26
 ; GFX8-NEXT:    v_min_f32_e32 v10, v10, v26
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v27
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v27
 ; GFX8-NEXT:    v_min_f32_e32 v11, v11, v27
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v28
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[26:27], v12, v28
 ; GFX8-NEXT:    v_min_f32_e32 v12, v12, v28
-; GFX8-NEXT:    v_cmp_o_f32_e64 s[26:27], v13, v29
-; GFX8-NEXT:    v_min_f32_e32 v13, v13, v29
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v18, v1, vcc
-; GFX8-NEXT:    v_cndmask_b32_e64 v14, v18, v16, s[40:41]
-; GFX8-NEXT:    v_cndmask_b32_e64 v0, v18, v19, s[28:29]
-; GFX8-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v3, v18, v3, s[6:7]
-; GFX8-NEXT:    v_cndmask_b32_e64 v4, v18, v4, s[8:9]
-; GFX8-NEXT:    v_cndmask_b32_e64 v5, v18, v5, s[10:11]
-; GFX8-NEXT:    v_cndmask_b32_e64 v6, v18, v6, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e64 v7, v18, v7, s[14:15]
-; GFX8-NEXT:    v_cndmask_b32_e64 v8, v18, v8, s[16:17]
-; GFX8-NEXT:    v_cndmask_b32_e64 v9, v18, v9, s[18:19]
-; GFX8-NEXT:    v_cndmask_b32_e64 v10, v18, v10, s[20:21]
-; GFX8-NEXT:    v_cndmask_b32_e64 v11, v18, v11, s[22:23]
-; GFX8-NEXT:    v_cndmask_b32_e64 v12, v18, v12, s[24:25]
-; GFX8-NEXT:    v_cndmask_b32_e64 v13, v18, v13, s[26:27]
+; GFX8-NEXT:    v_min_f32_e32 v19, v14, v30
+; GFX8-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
+; GFX8-NEXT:    v_cndmask_b32_e32 v0, v17, v0, vcc
+; GFX8-NEXT:    v_cndmask_b32_e64 v13, v17, v18, s[28:29]
+; GFX8-NEXT:    v_cndmask_b32_e64 v1, v17, v1, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v2, v17, v2, s[6:7]
+; GFX8-NEXT:    v_cndmask_b32_e64 v3, v17, v3, s[8:9]
+; GFX8-NEXT:    v_cndmask_b32_e64 v4, v17, v4, s[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v5, v17, v5, s[12:13]
+; GFX8-NEXT:    v_cndmask_b32_e64 v6, v17, v6, s[14:15]
+; GFX8-NEXT:    v_cndmask_b32_e64 v7, v17, v7, s[16:17]
+; GFX8-NEXT:    v_cndmask_b32_e64 v8, v17, v8, s[18:19]
+; GFX8-NEXT:    v_cndmask_b32_e64 v9, v17, v9, s[20:21]
+; GFX8-NEXT:    v_cndmask_b32_e64 v10, v17, v10, s[22:23]
+; GFX8-NEXT:    v_cndmask_b32_e64 v11, v17, v11, s[24:25]
+; GFX8-NEXT:    v_cndmask_b32_e64 v12, v17, v12, s[26:27]
+; GFX8-NEXT:    v_cndmask_b32_e64 v14, v17, v19, s[40:41]
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_min_f32_e32 v16, v15, v17
-; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
-; GFX8-NEXT:    v_cndmask_b32_e32 v15, v18, v16, vcc
+; GFX8-NEXT:    v_min_f32_e32 v18, v15, v16
+; GFX8-NEXT:    v_cmp_o_f32_e32 vcc, v15, v16
+; GFX8-NEXT:    v_cndmask_b32_e32 v15, v17, v18, vcc
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX900-LABEL: v_minimum_v16f32:
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v1, v17
+; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v0, v16
+; GFX900-NEXT:    v_min_f32_e32 v0, v0, v16
+; GFX900-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[4:5], v1, v17
 ; GFX900-NEXT:    v_min_f32_e32 v1, v1, v17
-; GFX900-NEXT:    buffer_load_dword v17, off, s[0:3], s32
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[4:5], v2, v18
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[6:7], v2, v18
 ; GFX900-NEXT:    v_min_f32_e32 v2, v2, v18
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[6:7], v3, v19
+; GFX900-NEXT:    v_mov_b32_e32 v17, 0x7fc00000
+; GFX900-NEXT:    v_min_f32_e32 v18, v13, v29
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[28:29], v13, v29
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[8:9], v3, v19
 ; GFX900-NEXT:    v_min_f32_e32 v3, v3, v19
-; GFX900-NEXT:    v_mov_b32_e32 v18, 0x7fc00000
-; GFX900-NEXT:    v_min_f32_e32 v19, v0, v16
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[28:29], v0, v16
-; GFX900-NEXT:    v_min_f32_e32 v16, v14, v30
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[8:9], v4, v20
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[10:11], v4, v20
 ; GFX900-NEXT:    v_min_f32_e32 v4, v4, v20
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[10:11], v5, v21
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[12:13], v5, v21
 ; GFX900-NEXT:    v_min_f32_e32 v5, v5, v21
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[12:13], v6, v22
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[14:15], v6, v22
 ; GFX900-NEXT:    v_min_f32_e32 v6, v6, v22
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[14:15], v7, v23
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[16:17], v7, v23
 ; GFX900-NEXT:    v_min_f32_e32 v7, v7, v23
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[16:17], v8, v24
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[18:19], v8, v24
 ; GFX900-NEXT:    v_min_f32_e32 v8, v8, v24
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[18:19], v9, v25
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[20:21], v9, v25
 ; GFX900-NEXT:    v_min_f32_e32 v9, v9, v25
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[20:21], v10, v26
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[22:23], v10, v26
 ; GFX900-NEXT:    v_min_f32_e32 v10, v10, v26
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[22:23], v11, v27
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[24:25], v11, v27
 ; GFX900-NEXT:    v_min_f32_e32 v11, v11, v27
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[24:25], v12, v28
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[26:27], v12, v28
 ; GFX900-NEXT:    v_min_f32_e32 v12, v12, v28
-; GFX900-NEXT:    v_cmp_o_f32_e64 s[26:27], v13, v29
-; GFX900-NEXT:    v_min_f32_e32 v13, v13, v29
-; GFX900-NEXT:    v_cndmask_b32_e32 v1, v18, v1, vcc
-; GFX900-NEXT:    v_cndmask_b32_e64 v14, v18, v16, s[40:41]
-; GFX900-NEXT:    v_cndmask_b32_e64 v0, v18, v19, s[28:29]
-; GFX900-NEXT:    v_cndmask_b32_e64 v2, v18, v2, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v3, v18, v3, s[6:7]
-; GFX900-NEXT:    v_cndmask_b32_e64 v4, v18, v4, s[8:9]
-; GFX900-NEXT:    v_cndmask_b32_e64 v5, v18, v5, s[10:11]
-; GFX900-NEXT:    v_cndmask_b32_e64 v6, v18, v6, s[12:13]
-; GFX900-NEXT:    v_cndmask_b32_e64 v7, v18, v7, s[14:15]
-; GFX900-NEXT:    v_cndmask_b32_e64 v8, v18, v8, s[16:17]
-; GFX900-NEXT:    v_cndmask_b32_e64 v9, v18, v9, s[18:19]
-; GFX900-NEXT:    v_cndmask_b32_e64 v10, v18, v10, s[20:21]
-; GFX900-NEXT:    v_cndmask_b32_e64 v11, v18, v11, s[22:23]
-; GFX900-NEXT:    v_cndmask_b32_e64 v12, v18, v12, s[24:25]
-; GFX900-NEXT:    v_cndmask_b32_e64 v13, v18, v13, s[26:27]
+; GFX900-NEXT:    v_min_f32_e32 v19, v14, v30
+; GFX900-NEXT:    v_cmp_o_f32_e64 s[40:41], v14, v30
+; GFX900-NEXT:    v_cndmask_b32_e32 v0, v17, v0, vcc
+; GFX900-NEXT:    v_cndmask_b32_e64 v13, v17, v18, s[28:29]
+; GFX900-NEXT:    v_cndmask_b32_e64 v1, v17, v1, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v2, v17, v2, s[6:7]
+; GFX900-NEXT:    v_cndmask_b32_e64 v3, v17, v3, s[8:9]
+; GFX900-NEXT:    v_cndmask_b32_e64 v4, v17, v4, s[10:11]
+; GFX900-NEXT:    v_cndmask_b32_e64 v5, v17, v5, s[12:13]
+; GFX900-NEXT:    v_cndmask_b32_e64 v6, v17, v6, s[14:15]
+; GFX900-NEXT:    v_cndmask_b32_e64 v7, v17, v7, s[16:17]
+; GFX900-NEXT:    v_cndmask_b32_e64 v8, v17, v8, s[18:19]
+; GFX900-NEXT:    v_cndmask_b32_e64 v9, v17, v9, s[20:21]
+; GFX900-NEXT:    v_cndmask_b32_e64 v10, v17, v10, s[22:23]
+; GFX900-NEXT:    v_cndmask_b32_e64 v11, v17, v11, s[24:25]
+; GFX900-NEXT:    v_cndmask_b32_e64 v12, v17, v12, s[26:27]
+; GFX900-NEXT:    v_cndmask_b32_e64 v14, v17, v19, s[40:41]
 ; GFX900-NEXT:    s_waitcnt vmcnt(0)
-; GFX900-NEXT:    v_min_f32_e32 v16, v15, v17
-; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v15, v17
-; GFX900-NEXT:    v_cndmask_b32_e32 v15, v18, v16, vcc
+; GFX900-NEXT:    v_min_f32_e32 v18, v15, v16
+; GFX900-NEXT:    v_cmp_o_f32_e32 vcc, v15, v16
+; GFX900-NEXT:    v_cndmask_b32_e32 v15, v17, v18, vcc
 ; GFX900-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX950-LABEL: v_minimum_v16f32:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll b/llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll
index 1d1673315f6ff..dfd67873c3b86 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll
@@ -820,18 +820,18 @@ define void @s_minimum_v2f64(<2 x double> inreg %src0, <2 x double> inreg %src1)
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    v_mov_b32_e32 v0, s22
-; GFX7-NEXT:    v_mov_b32_e32 v4, s20
 ; GFX7-NEXT:    v_mov_b32_e32 v1, s23
-; GFX7-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX7-NEXT:    v_min_f64 v[2:3], s[18:19], v[0:1]
 ; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, s[18:19], v[0:1]
-; GFX7-NEXT:    v_min_f64 v[0:1], s[16:17], v[4:5]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[4:5]
+; GFX7-NEXT:    v_mov_b32_e32 v0, s20
+; GFX7-NEXT:    v_mov_b32_e32 v1, s21
+; GFX7-NEXT:    v_min_f64 v[4:5], s[16:17], v[0:1]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[0:1]
 ; GFX7-NEXT:    v_mov_b32_e32 v6, 0x7ff80000
 ; GFX7-NEXT:    v_cndmask_b32_e32 v3, v3, v6, vcc
 ; GFX7-NEXT:    v_cndmask_b32_e64 v2, v2, 0, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v0, 0, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v1, v5, v6, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v0, v4, 0, s[4:5]
 ; GFX7-NEXT:    ;;#ASMSTART
 ; GFX7-NEXT:    ; use v[0:3]
 ; GFX7-NEXT:    ;;#ASMEND
@@ -841,18 +841,18 @@ define void @s_minimum_v2f64(<2 x double> inreg %src0, <2 x double> inreg %src1)
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    v_mov_b32_e32 v0, s22
-; GFX8-NEXT:    v_mov_b32_e32 v4, s20
 ; GFX8-NEXT:    v_mov_b32_e32 v1, s23
-; GFX8-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX8-NEXT:    v_min_f64 v[2:3], s[18:19], v[0:1]
 ; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, s[18:19], v[0:1]
-; GFX8-NEXT:    v_min_f64 v[0:1], s[16:17], v[4:5]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[4:5]
+; GFX8-NEXT:    v_mov_b32_e32 v0, s20
+; GFX8-NEXT:    v_mov_b32_e32 v1, s21
+; GFX8-NEXT:    v_min_f64 v[4:5], s[16:17], v[0:1]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[0:1]
 ; GFX8-NEXT:    v_mov_b32_e32 v6, 0x7ff80000
 ; GFX8-NEXT:    v_cndmask_b32_e32 v3, v3, v6, vcc
 ; GFX8-NEXT:    v_cndmask_b32_e64 v2, v2, 0, vcc
-; GFX8-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v0, v0, 0, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v1, v5, v6, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v0, v4, 0, s[4:5]
 ; GFX8-NEXT:    ;;#ASMSTART
 ; GFX8-NEXT:    ; use v[0:3]
 ; GFX8-NEXT:    ;;#ASMEND
@@ -862,18 +862,18 @@ define void @s_minimum_v2f64(<2 x double> inreg %src0, <2 x double> inreg %src1)
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX900-NEXT:    v_mov_b32_e32 v0, s22
-; GFX900-NEXT:    v_mov_b32_e32 v4, s20
 ; GFX900-NEXT:    v_mov_b32_e32 v1, s23
-; GFX900-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX900-NEXT:    v_min_f64 v[2:3], s[18:19], v[0:1]
 ; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, s[18:19], v[0:1]
-; GFX900-NEXT:    v_min_f64 v[0:1], s[16:17], v[4:5]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[4:5]
+; GFX900-NEXT:    v_mov_b32_e32 v0, s20
+; GFX900-NEXT:    v_mov_b32_e32 v1, s21
+; GFX900-NEXT:    v_min_f64 v[4:5], s[16:17], v[0:1]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], s[16:17], v[0:1]
 ; GFX900-NEXT:    v_mov_b32_e32 v6, 0x7ff80000
 ; GFX900-NEXT:    v_cndmask_b32_e32 v3, v3, v6, vcc
 ; GFX900-NEXT:    v_cndmask_b32_e64 v2, v2, 0, vcc
-; GFX900-NEXT:    v_cndmask_b32_e64 v1, v1, v6, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v0, v0, 0, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v1, v5, v6, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v0, v4, 0, s[4:5]
 ; GFX900-NEXT:    ;;#ASMSTART
 ; GFX900-NEXT:    ; use v[0:3]
 ; GFX900-NEXT:    ;;#ASMEND
@@ -1743,120 +1743,120 @@ define <8 x double> @v_minimum_v8f64(<8 x double> %src0, <8 x double> %src1) {
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX7-NEXT:    v_min_f64 v[32:33], v[2:3], v[18:19]
-; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[18:19]
-; GFX7-NEXT:    v_min_f64 v[18:19], v[4:5], v[20:21]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], v[4:5], v[20:21]
-; GFX7-NEXT:    v_min_f64 v[2:3], v[0:1], v[16:17]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[8:9], v[0:1], v[16:17]
+; GFX7-NEXT:    v_min_f64 v[32:33], v[0:1], v[16:17]
+; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[16:17]
+; GFX7-NEXT:    v_min_f64 v[16:17], v[2:3], v[18:19]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[4:5], v[2:3], v[18:19]
 ; GFX7-NEXT:    v_mov_b32_e32 v34, 0x7ff80000
+; GFX7-NEXT:    v_min_f64 v[18:19], v[4:5], v[20:21]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[6:7], v[4:5], v[20:21]
 ; GFX7-NEXT:    v_min_f64 v[20:21], v[6:7], v[22:23]
-; GFX7-NEXT:    v_cmp_u_f64_e64 s[6:7], v[6:7], v[22:23]
-; GFX7-NEXT:    v_min_f64 v[16:17], v[8:9], v[24:25]
+; GFX7-NEXT:    v_cmp_u_f64_e64 s[8:9], v[6:7], v[22:23]
+; GFX7-NEXT:    v_min_f64 v[22:23], v[8:9], v[24:25]
 ; GFX7-NEXT:    v_cmp_u_f64_e64 s[10:11], v[8:9], v[24:25]
-; GFX7-NEXT:    v_min_f64 v[22:23], v[10:11], v[26:27]
+; GFX7-NEXT:    v_min_f64 v[24:25], v[10:11], v[26:27]
 ; GFX7-NEXT:    v_cmp_u_f64_e64 s[12:13], v[10:11], v[26:27]
-; GFX7-NEXT:    v_min_f64 v[24:25], v[12:13], v[28:29]
+; GFX7-NEXT:    v_min_f64 v[26:27], v[12:13], v[28:29]
 ; GFX7-NEXT:    v_cmp_u_f64_e64 s[14:15], v[12:13], v[28:29]
-; GFX7-NEXT:    v_cndmask_b32_e64 v0, v2, 0, s[8:9]
-; GFX7-NEXT:    v_cndmask_b32_e64 v1, v3, v34, s[8:9]
-; GFX7-NEXT:    v_cndmask_b32_e64 v2, v32, 0, vcc
-; GFX7-NEXT:    v_cndmask_b32_e32 v3, v33, v34, vcc
-; GFX7-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[4:5]
-; GFX7-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[6:7]
-; GFX7-NEXT:    v_cndmask_b32_e64 v8, v16, 0, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v9, v17, v34, s[10:11]
-; GFX7-NEXT:    v_cndmask_b32_e64 v10, v22, 0, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v11, v23, v34, s[12:13]
-; GFX7-NEXT:    v_cndmask_b32_e64 v12, v24, 0, s[14:15]
-; GFX7-NEXT:    v_cndmask_b32_e64 v13, v25, v34, s[14:15]
+; GFX7-NEXT:    v_cndmask_b32_e64 v0, v32, 0, vcc
+; GFX7-NEXT:    v_cndmask_b32_e32 v1, v33, v34, vcc
+; GFX7-NEXT:    v_cndmask_b32_e64 v2, v16, 0, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v3, v17, v34, s[4:5]
+; GFX7-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[6:7]
+; GFX7-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[8:9]
+; GFX7-NEXT:    v_cndmask_b32_e64 v8, v22, 0, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v9, v23, v34, s[10:11]
+; GFX7-NEXT:    v_cndmask_b32_e64 v10, v24, 0, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v11, v25, v34, s[12:13]
+; GFX7-NEXT:    v_cndmask_b32_e64 v12, v26, 0, s[14:15]
+; GFX7-NEXT:    v_cndmask_b32_e64 v13, v27, v34, s[14:15]
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_min_f64 v[18:19], v[14:15], v[30:31]
+; GFX7-NEXT:    v_min_f64 v[16:17], v[14:15], v[30:31]
 ; GFX7-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[30:31]
-; GFX7-NEXT:    v_cndmask_b32_e64 v14, v18, 0, vcc
-; GFX7-NEXT:    v_cndmask_b32_e32 v15, v19, v34, vcc
+; GFX7-NEXT:    v_cndmask_b32_e64 v14, v16, 0, vcc
+; GFX7-NEXT:    v_cndmask_b32_e32 v15, v17, v34, vcc
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_minimum_v8f64:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX8-NEXT:    v_min_f64 v[32:33], v[2:3], v[18:19]
-; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[18:19]
-; GFX8-NEXT:    v_min_f64 v[18:19], v[4:5], v[20:21]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], v[4:5], v[20:21]
-; GFX8-NEXT:    v_min_f64 v[2:3], v[0:1], v[16:17]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[8:9], v[0:1], v[16:17]
+; GFX8-NEXT:    v_min_f64 v[32:33], v[0:1], v[16:17]
+; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[16:17]
+; GFX8-NEXT:    v_min_f64 v[16:17], v[2:3], v[18:19]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[4:5], v[2:3], v[18:19]
 ; GFX8-NEXT:    v_mov_b32_e32 v34, 0x7ff80000
+; GFX8-NEXT:    v_min_f64 v[18:19], v[4:5], v[20:21]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[6:7], v[4:5], v[20:21]
 ; GFX8-NEXT:    v_min_f64 v[20:21], v[6:7], v[22:23]
-; GFX8-NEXT:    v_cmp_u_f64_e64 s[6:7], v[6:7], v[22:23]
-; GFX8-NEXT:    v_min_f64 v[16:17], v[8:9], v[24:25]
+; GFX8-NEXT:    v_cmp_u_f64_e64 s[8:9], v[6:7], v[22:23]
+; GFX8-NEXT:    v_min_f64 v[22:23], v[8:9], v[24:25]
 ; GFX8-NEXT:    v_cmp_u_f64_e64 s[10:11], v[8:9], v[24:25]
-; GFX8-NEXT:    v_min_f64 v[22:23], v[10:11], v[26:27]
+; GFX8-NEXT:    v_min_f64 v[24:25], v[10:11], v[26:27]
 ; GFX8-NEXT:    v_cmp_u_f64_e64 s[12:13], v[10:11], v[26:27]
-; GFX8-NEXT:    v_min_f64 v[24:25], v[12:13], v[28:29]
+; GFX8-NEXT:    v_min_f64 v[26:27], v[12:13], v[28:29]
 ; GFX8-NEXT:    v_cmp_u_f64_e64 s[14:15], v[12:13], v[28:29]
-; GFX8-NEXT:    v_cndmask_b32_e64 v0, v2, 0, s[8:9]
-; GFX8-NEXT:    v_cndmask_b32_e64 v1, v3, v34, s[8:9]
-; GFX8-NEXT:    v_cndmask_b32_e64 v2, v32, 0, vcc
-; GFX8-NEXT:    v_cndmask_b32_e32 v3, v33, v34, vcc
-; GFX8-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[4:5]
-; GFX8-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[6:7]
-; GFX8-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[6:7]
-; GFX8-NEXT:    v_cndmask_b32_e64 v8, v16, 0, s[10:11]
-; GFX8-NEXT:    v_cndmask_b32_e64 v9, v17, v34, s[10:11]
-; GFX8-NEXT:    v_cndmask_b32_e64 v10, v22, 0, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e64 v11, v23, v34, s[12:13]
-; GFX8-NEXT:    v_cndmask_b32_e64 v12, v24, 0, s[14:15]
-; GFX8-NEXT:    v_cndmask_b32_e64 v13, v25, v34, s[14:15]
+; GFX8-NEXT:    v_cndmask_b32_e64 v0, v32, 0, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v1, v33, v34, vcc
+; GFX8-NEXT:    v_cndmask_b32_e64 v2, v16, 0, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v3, v17, v34, s[4:5]
+; GFX8-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[6:7]
+; GFX8-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[6:7]
+; GFX8-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[8:9]
+; GFX8-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[8:9]
+; GFX8-NEXT:    v_cndmask_b32_e64 v8, v22, 0, s[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v9, v23, v34, s[10:11]
+; GFX8-NEXT:    v_cndmask_b32_e64 v10, v24, 0, s[12:13]
+; GFX8-NEXT:    v_cndmask_b32_e64 v11, v25, v34, s[12:13]
+; GFX8-NEXT:    v_cndmask_b32_e64 v12, v26, 0, s[14:15]
+; GFX8-NEXT:    v_cndmask_b32_e64 v13, v27, v34, s[14:15]
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_min_f64 v[18:19], v[14:15], v[30:31]
+; GFX8-NEXT:    v_min_f64 v[16:17], v[14:15], v[30:31]
 ; GFX8-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[30:31]
-; GFX8-NEXT:    v_cndmask_b32_e64 v14, v18, 0, vcc
-; GFX8-NEXT:    v_cndmask_b32_e32 v15, v19, v34, vcc
+; GFX8-NEXT:    v_cndmask_b32_e64 v14, v16, 0, vcc
+; GFX8-NEXT:    v_cndmask_b32_e32 v15, v17, v34, vcc
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX900-LABEL: v_minimum_v8f64:
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX900-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX900-NEXT:    v_min_f64 v[32:33], v[2:3], v[18:19]
-; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[18:19]
-; GFX900-NEXT:    v_min_f64 v[18:19], v[4:5], v[20:21]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], v[4:5], v[20:21]
-; GFX900-NEXT:    v_min_f64 v[2:3], v[0:1], v[16:17]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[8:9], v[0:1], v[16:17]
+; GFX900-NEXT:    v_min_f64 v[32:33], v[0:1], v[16:17]
+; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[16:17]
+; GFX900-NEXT:    v_min_f64 v[16:17], v[2:3], v[18:19]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[4:5], v[2:3], v[18:19]
 ; GFX900-NEXT:    v_mov_b32_e32 v34, 0x7ff80000
+; GFX900-NEXT:    v_min_f64 v[18:19], v[4:5], v[20:21]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[6:7], v[4:5], v[20:21]
 ; GFX900-NEXT:    v_min_f64 v[20:21], v[6:7], v[22:23]
-; GFX900-NEXT:    v_cmp_u_f64_e64 s[6:7], v[6:7], v[22:23]
-; GFX900-NEXT:    v_min_f64 v[16:17], v[8:9], v[24:25]
+; GFX900-NEXT:    v_cmp_u_f64_e64 s[8:9], v[6:7], v[22:23]
+; GFX900-NEXT:    v_min_f64 v[22:23], v[8:9], v[24:25]
 ; GFX900-NEXT:    v_cmp_u_f64_e64 s[10:11], v[8:9], v[24:25]
-; GFX900-NEXT:    v_min_f64 v[22:23], v[10:11], v[26:27]
+; GFX900-NEXT:    v_min_f64 v[24:25], v[10:11], v[26:27]
 ; GFX900-NEXT:    v_cmp_u_f64_e64 s[12:13], v[10:11], v[26:27]
-; GFX900-NEXT:    v_min_f64 v[24:25], v[12:13], v[28:29]
+; GFX900-NEXT:    v_min_f64 v[26:27], v[12:13], v[28:29]
 ; GFX900-NEXT:    v_cmp_u_f64_e64 s[14:15], v[12:13], v[28:29]
-; GFX900-NEXT:    v_cndmask_b32_e64 v0, v2, 0, s[8:9]
-; GFX900-NEXT:    v_cndmask_b32_e64 v1, v3, v34, s[8:9]
-; GFX900-NEXT:    v_cndmask_b32_e64 v2, v32, 0, vcc
-; GFX900-NEXT:    v_cndmask_b32_e32 v3, v33, v34, vcc
-; GFX900-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[4:5]
-; GFX900-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[6:7]
-; GFX900-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[6:7]
-; GFX900-NEXT:    v_cndmask_b32_e64 v8, v16, 0, s[10:11]
-; GFX900-NEXT:    v_cndmask_b32_e64 v9, v17, v34, s[10:11]
-; GFX900-NEXT:    v_cndmask_b32_e64 v10, v22, 0, s[12:13]
-; GFX900-NEXT:    v_cndmask_b32_e64 v11, v23, v34, s[12:13]
-; GFX900-NEXT:    v_cndmask_b32_e64 v12, v24, 0, s[14:15]
-; GFX900-NEXT:    v_cndmask_b32_e64 v13, v25, v34, s[14:15]
+; GFX900-NEXT:    v_cndmask_b32_e64 v0, v32, 0, vcc
+; GFX900-NEXT:    v_cndmask_b32_e32 v1, v33, v34, vcc
+; GFX900-NEXT:    v_cndmask_b32_e64 v2, v16, 0, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v3, v17, v34, s[4:5]
+; GFX900-NEXT:    v_cndmask_b32_e64 v4, v18, 0, s[6:7]
+; GFX900-NEXT:    v_cndmask_b32_e64 v5, v19, v34, s[6:7]
+; GFX900-NEXT:    v_cndmask_b32_e64 v6, v20, 0, s[8:9]
+; GFX900-NEXT:    v_cndmask_b32_e64 v7, v21, v34, s[8:9]
+; GFX900-NEXT:    v_cndmask_b32_e64 v8, v22, 0, s[10:11]
+; GFX900-NEXT:    v_cndmask_b32_e64 v9, v23, v34, s[10:11]
+; GFX900-NEXT:    v_cndmask_b32_e64 v10, v24, 0, s[12:13]
+; GFX900-NEXT:    v_cndmask_b32_e64 v11, v25, v34, s[12:13]
+; GFX900-NEXT:    v_cndmask_b32_e64 v12, v26, 0, s[14:15]
+; GFX900-NEXT:    v_cndmask_b32_e64 v13, v27, v34, s[14:15]
 ; GFX900-NEXT:    s_waitcnt vmcnt(0)
-; GFX900-NEXT:    v_min_f64 v[18:19], v[14:15], v[30:31]
+; GFX900-NEXT:    v_min_f64 v[16:17], v[14:15], v[30:31]
 ; GFX900-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[30:31]
-; GFX900-NEXT:    v_cndmask_b32_e64 v14, v18, 0, vcc
-; GFX900-NEXT:    v_cndmask_b32_e32 v15, v19, v34, vcc
+; GFX900-NEXT:    v_cndmask_b32_e64 v14, v16, 0, vcc
+; GFX900-NEXT:    v_cndmask_b32_e32 v15, v17, v34, vcc
 ; GFX900-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX950-LABEL: v_minimum_v8f64:
@@ -2365,24 +2365,24 @@ define <16 x double> @v_minimum_v16f64(<16 x double> %src0, <16 x double> %src1)
 ; GFX950-LABEL: v_minimum_v16f64:
 ; GFX950:       ; %bb.0:
 ; GFX950-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX950-NEXT:    v_accvgpr_write_b32 a1, v40 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a2, v41 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a3, v42 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a4, v43 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a5, v44 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a6, v45 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a7, v46 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a8, v47 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a9, v56 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a10, v57 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a0, v40 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a1, v41 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a2, v42 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a3, v43 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a4, v44 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a5, v45 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a6, v46 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a7, v47 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a8, v56 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a9, v57 ; Reload Reuse
+; GFX950-NEXT:    scratch_load_dword v33, off, s32 offset:8
+; GFX950-NEXT:    scratch_load_dword v32, off, s32 offset:4
 ; GFX950-NEXT:    scratch_load_dword v37, off, s32 offset:16
 ; GFX950-NEXT:    scratch_load_dword v36, off, s32 offset:12
 ; GFX950-NEXT:    scratch_load_dword v39, off, s32 offset:24
 ; GFX950-NEXT:    scratch_load_dword v38, off, s32 offset:20
-; GFX950-NEXT:    scratch_load_dword v49, off, s32 offset:32
-; GFX950-NEXT:    scratch_load_dword v48, off, s32 offset:28
-; GFX950-NEXT:    scratch_load_dword v57, off, s32 offset:8
-; GFX950-NEXT:    scratch_load_dword v56, off, s32 offset:4
+; GFX950-NEXT:    scratch_load_dword v57, off, s32 offset:32
+; GFX950-NEXT:    scratch_load_dword v56, off, s32 offset:28
 ; GFX950-NEXT:    scratch_load_dword v47, off, s32 offset:40
 ; GFX950-NEXT:    scratch_load_dword v46, off, s32 offset:36
 ; GFX950-NEXT:    scratch_load_dword v45, off, s32 offset:48
@@ -2397,148 +2397,149 @@ define <16 x double> @v_minimum_v16f64(<16 x double> %src0, <16 x double> %src1)
 ; GFX950-NEXT:    scratch_load_dword v52, off, s32 offset:76
 ; GFX950-NEXT:    scratch_load_dword v51, off, s32 offset:88
 ; GFX950-NEXT:    scratch_load_dword v50, off, s32 offset:84
-; GFX950-NEXT:    scratch_load_dword v35, off, s32 offset:96
-; GFX950-NEXT:    scratch_load_dword v34, off, s32 offset:92
+; GFX950-NEXT:    scratch_load_dword v49, off, s32 offset:96
+; GFX950-NEXT:    scratch_load_dword v48, off, s32 offset:92
 ; GFX950-NEXT:    scratch_load_dword v31, off, s32
-; GFX950-NEXT:    scratch_load_dword v33, off, s32 offset:104
-; GFX950-NEXT:    scratch_load_dword v32, off, s32 offset:100
-; GFX950-NEXT:    v_accvgpr_write_b32 a11, v58 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a12, v59 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a13, v60 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a14, v61 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a15, v62 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_write_b32 a16, v63 ; Reload Reuse
+; GFX950-NEXT:    scratch_load_dword v35, off, s32 offset:104
+; GFX950-NEXT:    scratch_load_dword v34, off, s32 offset:100
+; GFX950-NEXT:    v_accvgpr_write_b32 a10, v58 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a11, v59 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a12, v60 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a13, v61 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a14, v62 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_write_b32 a15, v63 ; Reload Reuse
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_min_f64 v[58:59], v[2:3], v[36:37]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[2:3], v[36:37]
-; GFX950-NEXT:    scratch_load_dword v37, off, s32 offset:112
-; GFX950-NEXT:    scratch_load_dword v36, off, s32 offset:108
+; GFX950-NEXT:    v_min_f64 v[58:59], v[0:1], v[32:33]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[0:1], v[32:33]
+; GFX950-NEXT:    scratch_load_dword v33, off, s32 offset:112
+; GFX950-NEXT:    scratch_load_dword v32, off, s32 offset:108
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_min_f64 v[60:61], v[4:5], v[38:39]
-; GFX950-NEXT:    v_cmp_u_f64_e64 s[0:1], v[4:5], v[38:39]
-; GFX950-NEXT:    scratch_load_dword v39, off, s32 offset:120
-; GFX950-NEXT:    scratch_load_dword v38, off, s32 offset:116
+; GFX950-NEXT:    v_min_f64 v[60:61], v[2:3], v[36:37]
+; GFX950-NEXT:    v_cmp_u_f64_e64 s[0:1], v[2:3], v[36:37]
+; GFX950-NEXT:    scratch_load_dword v37, off, s32 offset:120
+; GFX950-NEXT:    scratch_load_dword v36, off, s32 offset:116
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_min_f64 v[62:63], v[6:7], v[48:49]
-; GFX950-NEXT:    v_cmp_u_f64_e64 s[2:3], v[6:7], v[48:49]
-; GFX950-NEXT:    scratch_load_dword v49, off, s32 offset:128
-; GFX950-NEXT:    scratch_load_dword v48, off, s32 offset:124
+; GFX950-NEXT:    v_min_f64 v[62:63], v[4:5], v[38:39]
+; GFX950-NEXT:    v_cmp_u_f64_e64 s[2:3], v[4:5], v[38:39]
+; GFX950-NEXT:    scratch_load_dword v39, off, s32 offset:128
+; GFX950-NEXT:    scratch_load_dword v38, off, s32 offset:124
+; GFX950-NEXT:    v_mov_b32_e32 v2, 0x7ff80000
 ; GFX950-NEXT:    s_waitcnt vmcnt(25)
-; GFX950-NEXT:    v_min_f64 v[2:3], v[0:1], v[56:57]
-; GFX950-NEXT:    v_cmp_u_f64_e64 s[4:5], v[0:1], v[56:57]
-; GFX950-NEXT:    v_mov_b32_e32 v0, 0x7ff80000
+; GFX950-NEXT:    v_min_f64 v[0:1], v[6:7], v[56:57]
+; GFX950-NEXT:    v_cmp_u_f64_e64 s[4:5], v[6:7], v[56:57]
 ; GFX950-NEXT:    s_waitcnt vmcnt(23)
 ; GFX950-NEXT:    v_min_f64 v[56:57], v[8:9], v[46:47]
-; GFX950-NEXT:    v_cndmask_b32_e64 v1, v2, 0, s[4:5]
-; GFX950-NEXT:    v_accvgpr_write_b32 a0, v1
-; GFX950-NEXT:    v_cndmask_b32_e64 v1, v3, v0, s[4:5]
-; GFX950-NEXT:    v_cndmask_b32_e64 v2, v58, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v3, v59, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e64 v58, v58, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v59, v59, v2, vcc
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[8:9], v[46:47]
-; GFX950-NEXT:    s_waitcnt vmcnt(21)
-; GFX950-NEXT:    v_min_f64 v[46:47], v[10:11], v[44:45]
-; GFX950-NEXT:    v_cndmask_b32_e64 v4, v60, 0, s[0:1]
+; GFX950-NEXT:    v_cndmask_b32_e64 v6, v0, 0, s[4:5]
+; GFX950-NEXT:    v_cndmask_b32_e64 v7, v1, v2, s[4:5]
 ; GFX950-NEXT:    v_cndmask_b32_e64 v8, v56, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v9, v57, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v9, v57, v2, vcc
+; GFX950-NEXT:    s_waitcnt vmcnt(21)
+; GFX950-NEXT:    v_min_f64 v[0:1], v[10:11], v[44:45]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[10:11], v[44:45]
+; GFX950-NEXT:    v_cndmask_b32_e64 v60, v60, 0, s[0:1]
+; GFX950-NEXT:    v_cndmask_b32_e64 v3, v61, v2, s[0:1]
+; GFX950-NEXT:    v_cndmask_b32_e64 v10, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v11, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(19)
-; GFX950-NEXT:    v_min_f64 v[44:45], v[12:13], v[42:43]
-; GFX950-NEXT:    v_cndmask_b32_e64 v5, v61, v0, s[0:1]
-; GFX950-NEXT:    v_cndmask_b32_e64 v10, v46, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v11, v47, v0, vcc
+; GFX950-NEXT:    v_min_f64 v[0:1], v[12:13], v[42:43]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[12:13], v[42:43]
+; GFX950-NEXT:    v_cndmask_b32_e64 v4, v62, 0, s[2:3]
+; GFX950-NEXT:    v_cndmask_b32_e64 v5, v63, v2, s[2:3]
+; GFX950-NEXT:    v_cndmask_b32_e64 v12, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v13, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(17)
-; GFX950-NEXT:    v_min_f64 v[42:43], v[14:15], v[40:41]
-; GFX950-NEXT:    v_cndmask_b32_e64 v6, v62, 0, s[2:3]
-; GFX950-NEXT:    v_cndmask_b32_e64 v12, v44, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v13, v45, v0, vcc
+; GFX950-NEXT:    v_min_f64 v[0:1], v[14:15], v[40:41]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[14:15], v[40:41]
+; GFX950-NEXT:    v_accvgpr_read_b32 v63, a15 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v62, a14 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v14, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v15, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(15)
-; GFX950-NEXT:    v_min_f64 v[40:41], v[16:17], v[54:55]
-; GFX950-NEXT:    v_cndmask_b32_e64 v7, v63, v0, s[2:3]
-; GFX950-NEXT:    v_cndmask_b32_e64 v14, v42, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v15, v43, v0, vcc
+; GFX950-NEXT:    v_min_f64 v[0:1], v[16:17], v[54:55]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[16:17], v[54:55]
+; GFX950-NEXT:    v_accvgpr_read_b32 v61, a13 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v57, a9 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v16, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v17, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(13)
-; GFX950-NEXT:    v_min_f64 v[54:55], v[18:19], v[52:53]
-; GFX950-NEXT:    v_accvgpr_read_b32 v63, a16 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v16, v40, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v17, v41, v0, vcc
+; GFX950-NEXT:    v_min_f64 v[0:1], v[18:19], v[52:53]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[18:19], v[52:53]
+; GFX950-NEXT:    v_accvgpr_read_b32 v56, a8 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v47, a7 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v18, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v19, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(11)
-; GFX950-NEXT:    v_min_f64 v[52:53], v[20:21], v[50:51]
-; GFX950-NEXT:    v_accvgpr_read_b32 v62, a15 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v18, v54, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v19, v55, v0, vcc
+; GFX950-NEXT:    v_min_f64 v[0:1], v[20:21], v[50:51]
 ; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[20:21], v[50:51]
+; GFX950-NEXT:    v_accvgpr_read_b32 v46, a6 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v45, a5 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v20, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v21, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(9)
-; GFX950-NEXT:    v_min_f64 v[50:51], v[22:23], v[34:35]
-; GFX950-NEXT:    v_accvgpr_read_b32 v61, a14 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v20, v52, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v21, v53, v0, vcc
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[22:23], v[34:35]
+; GFX950-NEXT:    v_min_f64 v[0:1], v[22:23], v[48:49]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[22:23], v[48:49]
+; GFX950-NEXT:    v_accvgpr_read_b32 v44, a4 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v43, a3 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v22, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v23, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(6)
-; GFX950-NEXT:    v_min_f64 v[34:35], v[24:25], v[32:33]
-; GFX950-NEXT:    v_accvgpr_read_b32 v60, a13 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v22, v50, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v23, v51, v0, vcc
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[24:25], v[32:33]
-; GFX950-NEXT:    v_accvgpr_read_b32 v59, a12 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v58, a11 ; Reload Reuse
-; GFX950-NEXT:    v_cndmask_b32_e64 v24, v34, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v25, v35, v0, vcc
-; GFX950-NEXT:    v_accvgpr_read_b32 v57, a10 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v56, a9 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v47, a8 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v46, a7 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v45, a6 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v44, a5 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v43, a4 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v42, a3 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v41, a2 ; Reload Reuse
-; GFX950-NEXT:    v_accvgpr_read_b32 v40, a1 ; Reload Reuse
+; GFX950-NEXT:    v_min_f64 v[0:1], v[24:25], v[34:35]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[24:25], v[34:35]
+; GFX950-NEXT:    v_accvgpr_read_b32 v42, a2 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v41, a1 ; Reload Reuse
+; GFX950-NEXT:    v_cndmask_b32_e64 v24, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v25, v1, v2, vcc
+; GFX950-NEXT:    v_accvgpr_read_b32 v40, a0 ; Reload Reuse
 ; GFX950-NEXT:    s_waitcnt vmcnt(4)
-; GFX950-NEXT:    v_min_f64 v[32:33], v[26:27], v[36:37]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[26:27], v[36:37]
+; GFX950-NEXT:    v_min_f64 v[0:1], v[26:27], v[32:33]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[26:27], v[32:33]
 ; GFX950-NEXT:    s_nop 1
-; GFX950-NEXT:    v_cndmask_b32_e64 v26, v32, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v27, v33, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e64 v26, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v27, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(2)
-; GFX950-NEXT:    v_min_f64 v[32:33], v[28:29], v[38:39]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[28:29], v[38:39]
+; GFX950-NEXT:    v_min_f64 v[0:1], v[28:29], v[36:37]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[28:29], v[36:37]
 ; GFX950-NEXT:    s_nop 1
-; GFX950-NEXT:    v_cndmask_b32_e64 v28, v32, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v29, v33, v0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e64 v28, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v29, v1, v2, vcc
 ; GFX950-NEXT:    s_waitcnt vmcnt(0)
-; GFX950-NEXT:    v_min_f64 v[32:33], v[30:31], v[48:49]
-; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[30:31], v[48:49]
+; GFX950-NEXT:    v_min_f64 v[0:1], v[30:31], v[38:39]
+; GFX950-NEXT:    v_cmp_u_f64_e32 vcc, v[30:31], v[38:39]
 ; GFX950-NEXT:    s_nop 1
-; GFX950-NEXT:    v_cndmask_b32_e64 v30, v32, 0, vcc
-; GFX950-NEXT:    v_cndmask_b32_e32 v31, v33, v0, vcc
-; GFX950-NEXT:    v_accvgpr_read_b32 v0, a0
+; GFX950-NEXT:    v_cndmask_b32_e64 v30, v0, 0, vcc
+; GFX950-NEXT:    v_cndmask_b32_e32 v31, v1, v2, vcc
+; GFX950-NEXT:    v_mov_b32_e32 v0, v58
+; GFX950-NEXT:    v_mov_b32_e32 v1, v59
+; GFX950-NEXT:    v_mov_b32_e32 v2, v60
+; GFX950-NEXT:    v_accvgpr_read_b32 v60, a12 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v59, a11 ; Reload Reuse
+; GFX950-NEXT:    v_accvgpr_read_b32 v58, a10 ; Reload Reuse
 ; GFX950-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: v_minimum_v16f64:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    s_clause 0x19
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:16
-; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:12
-; GFX10-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:24
-; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:20
-; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:32
-; GFX10-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:28
+; GFX10-NEXT:    s_clause 0x18
+; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
+; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
+; GFX10-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:16
+; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:12
+; GFX10-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:24
+; GFX10-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:20
 ; GFX10-NEXT:    buffer_load_dword v37, off, s[0:3], s32 offset:36
-; GFX10-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:68
-; GFX10-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:64
-; GFX10-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:60
-; GFX10-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:56
-; GFX10-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:52
-; GFX10-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:48
-; GFX10-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:44
+; GFX10-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:32
+; GFX10-NEXT:    buffer_load_dword v48, off, s[0:3], s32 offset:28
+; GFX10-NEXT:    buffer_load_dword v50, off, s[0:3], s32 offset:68
+; GFX10-NEXT:    buffer_load_dword v53, off, s[0:3], s32 offset:64
+; GFX10-NEXT:    buffer_load_dword v52, off, s[0:3], s32 offset:60
+; GFX10-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:56
+; GFX10-NEXT:    buffer_load_dword v54, off, s[0:3], s32 offset:52
+; GFX10-NEXT:    buffer_load_dword v65, off, s[0:3], s32 offset:48
+; GFX10-NEXT:    buffer_load_dword v64, off, s[0:3], s32 offset:44
 ; GFX10-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:40
-; GFX10-NEXT:    buffer_load_dword v65, off, s[0:3], s32 offset:8
-; GFX10-NEXT:    buffer_load_dword v64, off, s[0:3], s32 offset:4
 ; GFX10-NEXT:    buffer_load_dword v66, off, s[0:3], s32 offset:100
 ; GFX10-NEXT:    buffer_load_dword v69, off, s[0:3], s32 offset:96
 ; GFX10-NEXT:    buffer_load_dword v68, off, s[0:3], s32 offset:92
@@ -2546,96 +2547,95 @@ define <16 x double> @v_minimum_v16f64(<16 x double> %src0, <16 x double> %src1)
 ; GFX10-NEXT:    buffer_load_dword v70, off, s[0:3], s32 offset:84
 ; GFX10-NEXT:    buffer_load_dword v81, off, s[0:3], s32 offset:80
 ; GFX10-NEXT:    buffer_load_dword v80, off, s[0:3], s32 offset:76
-; GFX10-NEXT:    buffer_load_dword v49, off, s[0:3], s32 offset:72
+; GFX10-NEXT:    buffer_load_dword v51, off, s[0:3], s32 offset:72
+; GFX10-NEXT:    s_waitcnt vmcnt(23)
+; GFX10-NEXT:    v_min_f64 v[82:83], v[0:1], v[31:32]
+; GFX10-NEXT:    v_cmp_u_f64_e32 vcc_lo, v[0:1], v[31:32]
+; GFX10-NEXT:    s_waitcnt vmcnt(21)
+; GFX10-NEXT:    v_min_f64 v[84:85], v[2:3], v[33:34]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s4, v[2:3], v[33:34]
+; GFX10-NEXT:    s_waitcnt vmcnt(19)
+; GFX10-NEXT:    v_min_f64 v[32:33], v[4:5], v[35:36]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s5, v[4:5], v[35:36]
+; GFX10-NEXT:    s_clause 0x7
+; GFX10-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:112
 ; GFX10-NEXT:    buffer_load_dword v67, off, s[0:3], s32 offset:104
-; GFX10-NEXT:    s_waitcnt vmcnt(24)
-; GFX10-NEXT:    v_min_f64 v[82:83], v[2:3], v[31:32]
-; GFX10-NEXT:    v_cmp_u_f64_e32 vcc_lo, v[2:3], v[31:32]
-; GFX10-NEXT:    s_waitcnt vmcnt(22)
-; GFX10-NEXT:    v_min_f64 v[84:85], v[4:5], v[33:34]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s4, v[4:5], v[33:34]
-; GFX10-NEXT:    s_clause 0x3
+; GFX10-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:108
 ; GFX10-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:120
 ; GFX10-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:116
-; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:112
-; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:108
-; GFX10-NEXT:    s_waitcnt vmcnt(24)
-; GFX10-NEXT:    v_min_f64 v[32:33], v[6:7], v[35:36]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s5, v[6:7], v[35:36]
-; GFX10-NEXT:    s_clause 0x2
 ; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s32
-; GFX10-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:128
-; GFX10-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:124
-; GFX10-NEXT:    s_waitcnt vmcnt(23)
-; GFX10-NEXT:    v_cmp_u_f64_e64 s10, v[14:15], v[50:51]
+; GFX10-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:128
+; GFX10-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:124
+; GFX10-NEXT:    s_waitcnt vmcnt(24)
+; GFX10-NEXT:    v_min_f64 v[34:35], v[6:7], v[48:49]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s6, v[6:7], v[48:49]
 ; GFX10-NEXT:    s_waitcnt vmcnt(21)
-; GFX10-NEXT:    v_cmp_u_f64_e64 s9, v[12:13], v[52:53]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s10, v[14:15], v[52:53]
 ; GFX10-NEXT:    s_waitcnt vmcnt(19)
-; GFX10-NEXT:    v_cmp_u_f64_e64 s7, v[10:11], v[54:55]
-; GFX10-NEXT:    s_waitcnt vmcnt(18)
-; GFX10-NEXT:    v_min_f64 v[34:35], v[8:9], v[37:38]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s6, v[8:9], v[37:38]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s9, v[12:13], v[54:55]
+; GFX10-NEXT:    s_waitcnt vmcnt(17)
+; GFX10-NEXT:    v_cmp_u_f64_e64 s8, v[10:11], v[64:65]
 ; GFX10-NEXT:    s_waitcnt vmcnt(16)
-; GFX10-NEXT:    v_min_f64 v[8:9], v[0:1], v[64:65]
-; GFX10-NEXT:    v_min_f64 v[36:37], v[10:11], v[54:55]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s8, v[0:1], v[64:65]
-; GFX10-NEXT:    v_min_f64 v[38:39], v[12:13], v[52:53]
-; GFX10-NEXT:    v_min_f64 v[52:53], v[14:15], v[50:51]
+; GFX10-NEXT:    v_min_f64 v[48:49], v[8:9], v[37:38]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s7, v[8:9], v[37:38]
+; GFX10-NEXT:    v_min_f64 v[36:37], v[10:11], v[64:65]
+; GFX10-NEXT:    v_min_f64 v[38:39], v[12:13], v[54:55]
+; GFX10-NEXT:    v_min_f64 v[54:55], v[14:15], v[52:53]
 ; GFX10-NEXT:    s_waitcnt vmcnt(11)
-; GFX10-NEXT:    v_min_f64 v[54:55], v[20:21], v[70:71]
+; GFX10-NEXT:    v_min_f64 v[64:65], v[20:21], v[70:71]
 ; GFX10-NEXT:    v_cmp_u_f64_e64 s13, v[20:21], v[70:71]
 ; GFX10-NEXT:    s_waitcnt vmcnt(9)
 ; GFX10-NEXT:    v_cmp_u_f64_e64 s12, v[18:19], v[80:81]
 ; GFX10-NEXT:    s_waitcnt vmcnt(8)
-; GFX10-NEXT:    v_min_f64 v[50:51], v[16:17], v[48:49]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s11, v[16:17], v[48:49]
-; GFX10-NEXT:    v_min_f64 v[48:49], v[18:19], v[80:81]
-; GFX10-NEXT:    v_min_f64 v[64:65], v[22:23], v[68:69]
+; GFX10-NEXT:    v_min_f64 v[52:53], v[16:17], v[50:51]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s11, v[16:17], v[50:51]
+; GFX10-NEXT:    v_min_f64 v[50:51], v[18:19], v[80:81]
+; GFX10-NEXT:    v_min_f64 v[70:71], v[22:23], v[68:69]
 ; GFX10-NEXT:    v_cmp_u_f64_e64 s14, v[22:23], v[68:69]
-; GFX10-NEXT:    s_waitcnt vmcnt(7)
-; GFX10-NEXT:    v_min_f64 v[68:69], v[24:25], v[66:67]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s15, v[24:25], v[66:67]
-; GFX10-NEXT:    v_cndmask_b32_e64 v10, v36, 0, s7
-; GFX10-NEXT:    v_cndmask_b32_e64 v0, v8, 0, s8
-; GFX10-NEXT:    v_cndmask_b32_e64 v1, v9, 0x7ff80000, s8
-; GFX10-NEXT:    v_cndmask_b32_e64 v8, v34, 0, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v9, v35, 0x7ff80000, s6
-; GFX10-NEXT:    v_cndmask_b32_e64 v11, v37, 0x7ff80000, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v6, v34, 0, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v7, v35, 0x7ff80000, s6
+; GFX10-NEXT:    v_cndmask_b32_e64 v8, v48, 0, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v9, v49, 0x7ff80000, s7
+; GFX10-NEXT:    v_cndmask_b32_e64 v10, v36, 0, s8
+; GFX10-NEXT:    v_cndmask_b32_e64 v11, v37, 0x7ff80000, s8
 ; GFX10-NEXT:    v_cndmask_b32_e64 v12, v38, 0, s9
 ; GFX10-NEXT:    v_cndmask_b32_e64 v13, v39, 0x7ff80000, s9
-; GFX10-NEXT:    v_cndmask_b32_e64 v14, v52, 0, s10
-; GFX10-NEXT:    v_cndmask_b32_e64 v15, v53, 0x7ff80000, s10
-; GFX10-NEXT:    v_cndmask_b32_e64 v16, v50, 0, s11
-; GFX10-NEXT:    v_cndmask_b32_e64 v17, v51, 0x7ff80000, s11
-; GFX10-NEXT:    v_cndmask_b32_e64 v18, v48, 0, s12
-; GFX10-NEXT:    v_cndmask_b32_e64 v19, v49, 0x7ff80000, s12
-; GFX10-NEXT:    v_cndmask_b32_e64 v20, v54, 0, s13
-; GFX10-NEXT:    v_cndmask_b32_e64 v21, v55, 0x7ff80000, s13
-; GFX10-NEXT:    v_cndmask_b32_e64 v22, v64, 0, s14
-; GFX10-NEXT:    v_cndmask_b32_e64 v23, v65, 0x7ff80000, s14
-; GFX10-NEXT:    v_cndmask_b32_e64 v24, v68, 0, s15
-; GFX10-NEXT:    v_cndmask_b32_e64 v25, v69, 0x7ff80000, s15
+; GFX10-NEXT:    v_cndmask_b32_e64 v14, v54, 0, s10
+; GFX10-NEXT:    v_cndmask_b32_e64 v15, v55, 0x7ff80000, s10
+; GFX10-NEXT:    v_cndmask_b32_e64 v16, v52, 0, s11
+; GFX10-NEXT:    v_cndmask_b32_e64 v17, v53, 0x7ff80000, s11
+; GFX10-NEXT:    v_cndmask_b32_e64 v18, v50, 0, s12
+; GFX10-NEXT:    v_cndmask_b32_e64 v19, v51, 0x7ff80000, s12
+; GFX10-NEXT:    v_cndmask_b32_e64 v20, v64, 0, s13
+; GFX10-NEXT:    v_cndmask_b32_e64 v21, v65, 0x7ff80000, s13
+; GFX10-NEXT:    v_cndmask_b32_e64 v22, v70, 0, s14
+; GFX10-NEXT:    v_cndmask_b32_e64 v23, v71, 0x7ff80000, s14
+; GFX10-NEXT:    s_waitcnt vmcnt(6)
+; GFX10-NEXT:    v_min_f64 v[68:69], v[24:25], v[66:67]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s15, v[24:25], v[66:67]
 ; GFX10-NEXT:    s_waitcnt vmcnt(5)
-; GFX10-NEXT:    v_min_f64 v[70:71], v[28:29], v[2:3]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s17, v[28:29], v[2:3]
+; GFX10-NEXT:    v_min_f64 v[66:67], v[26:27], v[0:1]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s16, v[26:27], v[0:1]
 ; GFX10-NEXT:    s_waitcnt vmcnt(3)
-; GFX10-NEXT:    v_min_f64 v[66:67], v[26:27], v[4:5]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s16, v[26:27], v[4:5]
-; GFX10-NEXT:    v_cndmask_b32_e64 v2, v82, 0, vcc_lo
+; GFX10-NEXT:    v_min_f64 v[80:81], v[28:29], v[2:3]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s17, v[28:29], v[2:3]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_min_f64 v[80:81], v[30:31], v[6:7]
-; GFX10-NEXT:    v_cmp_u_f64_e64 s18, v[30:31], v[6:7]
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, v83, 0x7ff80000, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v4, v84, 0, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v5, v85, 0x7ff80000, s4
-; GFX10-NEXT:    v_cndmask_b32_e64 v6, v32, 0, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v7, v33, 0x7ff80000, s5
-; GFX10-NEXT:    v_cndmask_b32_e64 v28, v70, 0, s17
-; GFX10-NEXT:    v_cndmask_b32_e64 v29, v71, 0x7ff80000, s17
+; GFX10-NEXT:    v_min_f64 v[86:87], v[30:31], v[4:5]
+; GFX10-NEXT:    v_cmp_u_f64_e64 s18, v[30:31], v[4:5]
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v82, 0, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v83, 0x7ff80000, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v2, v84, 0, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, v85, 0x7ff80000, s4
+; GFX10-NEXT:    v_cndmask_b32_e64 v4, v32, 0, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v5, v33, 0x7ff80000, s5
+; GFX10-NEXT:    v_cndmask_b32_e64 v24, v68, 0, s15
+; GFX10-NEXT:    v_cndmask_b32_e64 v25, v69, 0x7ff80000, s15
 ; GFX10-NEXT:    v_cndmask_b32_e64 v26, v66, 0, s16
 ; GFX10-NEXT:    v_cndmask_b32_e64 v27, v67, 0x7ff80000, s16
-; GFX10-NEXT:    v_cndmask_b32_e64 v30, v80, 0, s18
-; GFX10-NEXT:    v_cndmask_b32_e64 v31, v81, 0x7ff80000, s18
+; GFX10-NEXT:    v_cndmask_b32_e64 v28, v80, 0, s17
+; GFX10-NEXT:    v_cndmask_b32_e64 v29, v81, 0x7ff80000, s17
+; GFX10-NEXT:    v_cndmask_b32_e64 v30, v86, 0, s18
+; GFX10-NEXT:    v_cndmask_b32_e64 v31, v87, 0x7ff80000, s18
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_minimum_v16f64:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.round.ll b/llvm/test/CodeGen/AMDGPU/llvm.round.ll
index c29362898f40e..42671f9dd6747 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.round.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.round.ll
@@ -651,10 +651,10 @@ define amdgpu_kernel void @round_v8f32(ptr addrspace(1) %out, <8 x float> %in) #
 ; GFX11-NEXT:    v_dual_sub_f32 v2, s11, v0 :: v_dual_sub_f32 v3, s10, v1
 ; GFX11-NEXT:    v_sub_f32_e32 v7, s9, v4
 ; GFX11-NEXT:    v_trunc_f32_e32 v9, s13
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-NEXT:    v_sub_f32_e32 v12, s15, v5
-; GFX11-NEXT:    v_cmp_ge_f32_e64 s2, |v2|, 0.5
 ; GFX11-NEXT:    v_sub_f32_e32 v11, s8, v8
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_4)
+; GFX11-NEXT:    v_cmp_ge_f32_e64 s2, |v2|, 0.5
+; GFX11-NEXT:    v_sub_f32_e32 v12, s15, v5
 ; GFX11-NEXT:    v_trunc_f32_e32 v6, s14
 ; GFX11-NEXT:    v_sub_f32_e32 v14, s13, v9
 ; GFX11-NEXT:    v_trunc_f32_e32 v10, s12
diff --git a/llvm/test/CodeGen/AMDGPU/load-constant-i1.ll b/llvm/test/CodeGen/AMDGPU/load-constant-i1.ll
index a9240eff8e691..67c2ee6403558 100644
--- a/llvm/test/CodeGen/AMDGPU/load-constant-i1.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-constant-i1.ll
@@ -1697,7 +1697,6 @@ define amdgpu_kernel void @constant_zextload_v16i1_to_v16i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_bfe_u32 s7, s2, 0x10009
 ; GFX12-NEXT:    s_bfe_u32 s8, s2, 0x1000d
 ; GFX12-NEXT:    s_and_b32 s9, s2, 1
-; GFX12-NEXT:    v_mov_b32_e32 v1, s8
 ; GFX12-NEXT:    s_bfe_u32 s10, s2, 0x1000a
 ; GFX12-NEXT:    s_bfe_u32 s2, s2, 0x1000c
 ; GFX12-NEXT:    s_bfe_u32 s11, s6, 0x10005
@@ -1709,6 +1708,7 @@ define amdgpu_kernel void @constant_zextload_v16i1_to_v16i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_bfe_u32 s17, s6, 0x10008
 ; GFX12-NEXT:    s_bfe_u32 s6, s6, 0x1000e
 ; GFX12-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v5, s7
+; GFX12-NEXT:    v_mov_b32_e32 v1, s8
 ; GFX12-NEXT:    v_dual_mov_b32 v15, s3 :: v_dual_mov_b32 v2, s6
 ; GFX12-NEXT:    v_dual_mov_b32 v3, s13 :: v_dual_mov_b32 v4, s17
 ; GFX12-NEXT:    v_dual_mov_b32 v6, s10 :: v_dual_mov_b32 v11, s5
@@ -2266,8 +2266,8 @@ define amdgpu_kernel void @constant_zextload_v32i1_to_v32i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s12 :: v_dual_mov_b32 v0, s27
-; GFX12-NEXT:    v_dual_mov_b32 v3, s11 :: v_dual_mov_b32 v2, s26
+; GFX12-NEXT:    v_dual_mov_b32 v0, s27 :: v_dual_mov_b32 v3, s11
+; GFX12-NEXT:    v_dual_mov_b32 v1, s12 :: v_dual_mov_b32 v2, s26
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s10 :: v_dual_mov_b32 v4, s25
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s9 :: v_dual_mov_b32 v6, s24
 ; GFX12-NEXT:    v_dual_mov_b32 v13, s8 :: v_dual_mov_b32 v12, s23
@@ -2668,8 +2668,8 @@ define amdgpu_kernel void @constant_sextload_v32i1_to_v32i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s21 :: v_dual_mov_b32 v0, s22
-; GFX12-NEXT:    v_dual_mov_b32 v3, s19 :: v_dual_mov_b32 v2, s20
+; GFX12-NEXT:    v_dual_mov_b32 v0, s22 :: v_dual_mov_b32 v3, s19
+; GFX12-NEXT:    v_dual_mov_b32 v1, s21 :: v_dual_mov_b32 v2, s20
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s17 :: v_dual_mov_b32 v4, s18
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s15 :: v_dual_mov_b32 v6, s16
 ; GFX12-NEXT:    v_dual_mov_b32 v13, s13 :: v_dual_mov_b32 v12, s14
@@ -3314,8 +3314,8 @@ define amdgpu_kernel void @constant_zextload_v64i1_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[0:1] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[0:1] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s26 :: v_dual_mov_b32 v0, s3
-; GFX12-NEXT:    v_dual_mov_b32 v3, s25 :: v_dual_mov_b32 v2, s57
+; GFX12-NEXT:    v_dual_mov_b32 v0, s3 :: v_dual_mov_b32 v3, s25
+; GFX12-NEXT:    v_dual_mov_b32 v1, s26 :: v_dual_mov_b32 v2, s57
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s24 :: v_dual_mov_b32 v4, s56
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s23 :: v_dual_mov_b32 v6, s55
 ; GFX12-NEXT:    v_mov_b32_e32 v9, s22
@@ -3367,8 +3367,8 @@ define amdgpu_kernel void @constant_zextload_v64i1_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[0:1] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v0, s47
-; GFX12-NEXT:    v_dual_mov_b32 v3, s14 :: v_dual_mov_b32 v2, s45
+; GFX12-NEXT:    v_dual_mov_b32 v0, s47 :: v_dual_mov_b32 v3, s14
+; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v2, s45
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s13 :: v_dual_mov_b32 v4, s44
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s12 :: v_dual_mov_b32 v6, s43
 ; GFX12-NEXT:    v_dual_mov_b32 v9, s11 :: v_dual_mov_b32 v8, s42
@@ -4075,8 +4075,8 @@ define amdgpu_kernel void @constant_sextload_v64i1_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[0:1] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[0:1] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s50 :: v_dual_mov_b32 v0, s3
-; GFX12-NEXT:    v_dual_mov_b32 v3, s48 :: v_dual_mov_b32 v2, s49
+; GFX12-NEXT:    v_dual_mov_b32 v0, s3 :: v_dual_mov_b32 v3, s48
+; GFX12-NEXT:    v_dual_mov_b32 v1, s50 :: v_dual_mov_b32 v2, s49
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s46 :: v_dual_mov_b32 v4, s47
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s44 :: v_dual_mov_b32 v6, s45
 ; GFX12-NEXT:    v_mov_b32_e32 v9, s42
@@ -4128,8 +4128,8 @@ define amdgpu_kernel void @constant_sextload_v64i1_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[0:1] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s26 :: v_dual_mov_b32 v0, s27
-; GFX12-NEXT:    v_dual_mov_b32 v3, s24 :: v_dual_mov_b32 v2, s25
+; GFX12-NEXT:    v_dual_mov_b32 v0, s27 :: v_dual_mov_b32 v3, s24
+; GFX12-NEXT:    v_dual_mov_b32 v1, s26 :: v_dual_mov_b32 v2, s25
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s22 :: v_dual_mov_b32 v4, s23
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s20 :: v_dual_mov_b32 v6, s21
 ; GFX12-NEXT:    v_dual_mov_b32 v9, s18 :: v_dual_mov_b32 v8, s19
@@ -5653,14 +5653,13 @@ define amdgpu_kernel void @constant_zextload_v16i1_to_v16i64(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    global_load_u16 v0, v1, s[2:3]
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
-; GFX12-NEXT:    v_and_b32_e32 v4, 0xffff, v0
 ; GFX12-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX12-NEXT:    v_mov_b32_e32 v7, v1
+; GFX12-NEXT:    v_dual_mov_b32 v7, v1 :: v_dual_and_b32 v4, 0xffff, v0
 ; GFX12-NEXT:    v_mov_b32_e32 v11, v1
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX12-NEXT:    v_bfe_u32 v2, v4, 11, 1
 ; GFX12-NEXT:    s_bfe_u32 s3, s2, 0x1000a
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_3)
 ; GFX12-NEXT:    v_dual_mov_b32 v3, v1 :: v_dual_mov_b32 v0, s3
+; GFX12-NEXT:    v_bfe_u32 v2, v4, 11, 1
 ; GFX12-NEXT:    s_bfe_u32 s3, s2, 0x1000d
 ; GFX12-NEXT:    s_bfe_u32 s4, s2, 0x1000c
 ; GFX12-NEXT:    v_mov_b32_e32 v5, v1
@@ -7229,8 +7228,8 @@ define amdgpu_kernel void @constant_sextload_v32i1_to_v32i64(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[0:1] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[0:1] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s27 :: v_dual_mov_b32 v0, s26
-; GFX12-NEXT:    v_dual_mov_b32 v3, s51 :: v_dual_mov_b32 v2, s50
+; GFX12-NEXT:    v_dual_mov_b32 v0, s26 :: v_dual_mov_b32 v3, s51
+; GFX12-NEXT:    v_dual_mov_b32 v1, s27 :: v_dual_mov_b32 v2, s50
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s53
 ; GFX12-NEXT:    s_lshr_b32 s30, s2, 12
 ; GFX12-NEXT:    s_lshr_b32 s28, s2, 13
@@ -7273,8 +7272,8 @@ define amdgpu_kernel void @constant_sextload_v32i1_to_v32i64(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[0:1] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s25 :: v_dual_mov_b32 v0, s24
-; GFX12-NEXT:    v_dual_mov_b32 v3, s23 :: v_dual_mov_b32 v2, s22
+; GFX12-NEXT:    v_dual_mov_b32 v0, s24 :: v_dual_mov_b32 v3, s23
+; GFX12-NEXT:    v_dual_mov_b32 v1, s25 :: v_dual_mov_b32 v2, s22
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX12-NEXT:    s_lshr_b32 s68, s2, 1
 ; GFX12-NEXT:    s_bfe_i64 s[10:11], s[10:11], 0x10000
diff --git a/llvm/test/CodeGen/AMDGPU/load-constant-i16.ll b/llvm/test/CodeGen/AMDGPU/load-constant-i16.ll
index 817c5def5614f..31672fe32c44f 100644
--- a/llvm/test/CodeGen/AMDGPU/load-constant-i16.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-constant-i16.ll
@@ -661,10 +661,10 @@ define amdgpu_kernel void @constant_load_v16i16_align2(ptr addrspace(4) %ptr0) #
 ; GCN-NOHSA-VI-NEXT:    flat_load_ushort v17, v[2:3]
 ; GCN-NOHSA-VI-NEXT:    flat_load_ushort v18, v[4:5]
 ; GCN-NOHSA-VI-NEXT:    flat_load_ushort v19, v[6:7]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v20, v[8:9]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v21, v[10:11]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v12, v[12:13]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v13, v[14:15]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v8, v[8:9]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v9, v[10:11]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v10, v[12:13]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v11, v[14:15]
 ; GCN-NOHSA-VI-NEXT:    s_addc_u32 s3, s1, 0
 ; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v0, s2
 ; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v1, s3
@@ -681,27 +681,27 @@ define amdgpu_kernel void @constant_load_v16i16_align2(ptr addrspace(4) %ptr0) #
 ; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v7, s3
 ; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v6, s2
 ; GCN-NOHSA-VI-NEXT:    s_add_u32 s2, s0, 18
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v12, v[0:1]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v13, v[2:3]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v14, v[4:5]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v15, v[6:7]
 ; GCN-NOHSA-VI-NEXT:    s_addc_u32 s3, s1, 0
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v9, s3
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v8, s2
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v0, s2
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v1, s3
 ; GCN-NOHSA-VI-NEXT:    s_add_u32 s2, s0, 16
 ; GCN-NOHSA-VI-NEXT:    s_addc_u32 s3, s1, 0
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v11, s3
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v10, s2
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v2, s2
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v3, s3
 ; GCN-NOHSA-VI-NEXT:    s_add_u32 s2, s0, 2
 ; GCN-NOHSA-VI-NEXT:    s_addc_u32 s3, s1, 0
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v14, v[0:1]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v15, v[2:3]
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v0, s2
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v4, v[4:5]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v5, v[6:7]
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v3, s1
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v1, s3
-; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v8, v[8:9]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v9, v[10:11]
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v5, s3
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v7, s1
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v4, s2
+; GCN-NOHSA-VI-NEXT:    v_mov_b32_e32 v6, s0
 ; GCN-NOHSA-VI-NEXT:    flat_load_ushort v0, v[0:1]
-; GCN-NOHSA-VI-NEXT:    flat_load_ushort v10, v[2:3]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v20, v[2:3]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v21, v[4:5]
+; GCN-NOHSA-VI-NEXT:    flat_load_ushort v22, v[6:7]
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(14)
 ; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v16
 ; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v3, v17, v1
@@ -710,29 +710,29 @@ define amdgpu_kernel void @constant_load_v16i16_align2(ptr addrspace(4) %ptr0) #
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(12)
 ; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v2, v19, v1
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(11)
-; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v20
+; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v8
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(10)
-; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v1, v21, v1
+; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v1, v9, v1
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(9)
-; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v6, 16, v12
+; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v4, 16, v10
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(8)
-; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v7, v13, v6
+; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v7, v11, v4
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(7)
-; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v6, 16, v14
+; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v4, 16, v12
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(6)
-; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v6, v15, v6
+; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v6, v13, v4
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(5)
-; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v4, 16, v14
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(4)
-; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v5, v5, v4
+; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v5, v15, v4
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(3)
-; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v4, 16, v8
+; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(2)
-; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v4, v9, v4
+; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v4, v20, v0
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
+; GCN-NOHSA-VI-NEXT:    v_lshlrev_b32_e32 v0, 16, v21
 ; GCN-NOHSA-VI-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v0, v10, v0
+; GCN-NOHSA-VI-NEXT:    v_or_b32_e32 v0, v22, v0
 ; GCN-NOHSA-VI-NEXT:    flat_store_dwordx4 v[0:1], v[4:7]
 ; GCN-NOHSA-VI-NEXT:    flat_store_dwordx4 v[0:1], v[0:3]
 ; GCN-NOHSA-VI-NEXT:    s_endpgm
@@ -760,34 +760,34 @@ define amdgpu_kernel void @constant_load_v16i16_align2(ptr addrspace(4) %ptr0) #
 ; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v8, 0
 ; GFX12-TRUE16-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-TRUE16-NEXT:    s_clause 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v3, v8, s[0:1] offset:28
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v2, v8, s[0:1] offset:24
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v1, v8, s[0:1] offset:20
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v0, v8, s[0:1] offset:16
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v7, v8, s[0:1] offset:12
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v6, v8, s[0:1] offset:8
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v5, v8, s[0:1] offset:4
-; GFX12-TRUE16-NEXT:    global_load_d16_b16 v4, v8, s[0:1]
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v3, v8, s[0:1] offset:12
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v7, v8, s[0:1] offset:28
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v6, v8, s[0:1] offset:24
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v5, v8, s[0:1] offset:20
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v4, v8, s[0:1] offset:16
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v2, v8, s[0:1] offset:8
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v1, v8, s[0:1] offset:4
+; GFX12-TRUE16-NEXT:    global_load_d16_b16 v0, v8, s[0:1]
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v3, v8, s[0:1] offset:30
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v3, v8, s[0:1] offset:14
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v2, v8, s[0:1] offset:26
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v7, v8, s[0:1] offset:30
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v1, v8, s[0:1] offset:22
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v6, v8, s[0:1] offset:26
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v0, v8, s[0:1] offset:18
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v5, v8, s[0:1] offset:22
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v7, v8, s[0:1] offset:14
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v4, v8, s[0:1] offset:18
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v6, v8, s[0:1] offset:10
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v2, v8, s[0:1] offset:10
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v5, v8, s[0:1] offset:6
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v1, v8, s[0:1] offset:6
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v4, v8, s[0:1] offset:2
-; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x4
-; GFX12-TRUE16-NEXT:    global_store_b128 v[0:1], v[0:3], off
+; GFX12-TRUE16-NEXT:    global_load_d16_hi_b16 v0, v8, s[0:1] offset:2
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
+; GFX12-TRUE16-NEXT:    s_clause 0x1
 ; GFX12-TRUE16-NEXT:    global_store_b128 v[0:1], v[4:7], off
+; GFX12-TRUE16-NEXT:    global_store_b128 v[0:1], v[0:3], off
 ; GFX12-TRUE16-NEXT:    s_endpgm
 ;
 ; GFX12-FAKE16-LABEL: constant_load_v16i16_align2:
@@ -796,34 +796,34 @@ define amdgpu_kernel void @constant_load_v16i16_align2(ptr addrspace(4) %ptr0) #
 ; GFX12-FAKE16-NEXT:    v_mov_b32_e32 v8, 0
 ; GFX12-FAKE16-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-FAKE16-NEXT:    s_clause 0x7
-; GFX12-FAKE16-NEXT:    global_load_u16 v3, v8, s[0:1] offset:28
-; GFX12-FAKE16-NEXT:    global_load_u16 v2, v8, s[0:1] offset:24
-; GFX12-FAKE16-NEXT:    global_load_u16 v1, v8, s[0:1] offset:20
-; GFX12-FAKE16-NEXT:    global_load_u16 v0, v8, s[0:1] offset:16
-; GFX12-FAKE16-NEXT:    global_load_u16 v7, v8, s[0:1] offset:12
-; GFX12-FAKE16-NEXT:    global_load_u16 v6, v8, s[0:1] offset:8
-; GFX12-FAKE16-NEXT:    global_load_u16 v5, v8, s[0:1] offset:4
-; GFX12-FAKE16-NEXT:    global_load_u16 v4, v8, s[0:1]
+; GFX12-FAKE16-NEXT:    global_load_u16 v3, v8, s[0:1] offset:12
+; GFX12-FAKE16-NEXT:    global_load_u16 v7, v8, s[0:1] offset:28
+; GFX12-FAKE16-NEXT:    global_load_u16 v6, v8, s[0:1] offset:24
+; GFX12-FAKE16-NEXT:    global_load_u16 v5, v8, s[0:1] offset:20
+; GFX12-FAKE16-NEXT:    global_load_u16 v4, v8, s[0:1] offset:16
+; GFX12-FAKE16-NEXT:    global_load_u16 v2, v8, s[0:1] offset:8
+; GFX12-FAKE16-NEXT:    global_load_u16 v1, v8, s[0:1] offset:4
+; GFX12-FAKE16-NEXT:    global_load_u16 v0, v8, s[0:1]
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v3, v8, s[0:1] offset:30
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v3, v8, s[0:1] offset:14
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v2, v8, s[0:1] offset:26
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v7, v8, s[0:1] offset:30
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v1, v8, s[0:1] offset:22
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v6, v8, s[0:1] offset:26
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v0, v8, s[0:1] offset:18
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v5, v8, s[0:1] offset:22
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v7, v8, s[0:1] offset:14
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v4, v8, s[0:1] offset:18
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v6, v8, s[0:1] offset:10
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v2, v8, s[0:1] offset:10
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v5, v8, s[0:1] offset:6
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v1, v8, s[0:1] offset:6
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x7
-; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v4, v8, s[0:1] offset:2
-; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x4
-; GFX12-FAKE16-NEXT:    global_store_b128 v[0:1], v[0:3], off
+; GFX12-FAKE16-NEXT:    global_load_d16_hi_b16 v0, v8, s[0:1] offset:2
 ; GFX12-FAKE16-NEXT:    s_wait_loadcnt 0x0
+; GFX12-FAKE16-NEXT:    s_clause 0x1
 ; GFX12-FAKE16-NEXT:    global_store_b128 v[0:1], v[4:7], off
+; GFX12-FAKE16-NEXT:    global_store_b128 v[0:1], v[0:3], off
 ; GFX12-FAKE16-NEXT:    s_endpgm
 entry:
   %ld =  load <16 x i16>, ptr addrspace(4) %ptr0, align 2
@@ -3051,8 +3051,8 @@ define amdgpu_kernel void @constant_zextload_v32i16_to_v32i32(ptr addrspace(1) %
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[16:17] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[16:17] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s27 :: v_dual_mov_b32 v0, s8
-; GFX12-NEXT:    v_dual_mov_b32 v3, s26 :: v_dual_mov_b32 v2, s9
+; GFX12-NEXT:    v_dual_mov_b32 v0, s8 :: v_dual_mov_b32 v3, s26
+; GFX12-NEXT:    v_dual_mov_b32 v1, s27 :: v_dual_mov_b32 v2, s9
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s25
 ; GFX12-NEXT:    s_lshr_b32 s20, s3, 16
 ; GFX12-NEXT:    s_and_b32 s3, s3, 0xffff
@@ -3545,8 +3545,8 @@ define amdgpu_kernel void @constant_sextload_v32i16_to_v32i32(ptr addrspace(1) %
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[16:17] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[16:17] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s27 :: v_dual_mov_b32 v0, s8
-; GFX12-NEXT:    v_dual_mov_b32 v3, s26 :: v_dual_mov_b32 v2, s9
+; GFX12-NEXT:    v_dual_mov_b32 v0, s8 :: v_dual_mov_b32 v3, s26
+; GFX12-NEXT:    v_dual_mov_b32 v1, s27 :: v_dual_mov_b32 v2, s9
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s25
 ; GFX12-NEXT:    s_ashr_i32 s20, s3, 16
 ; GFX12-NEXT:    s_ashr_i32 s21, s2, 16
@@ -4415,8 +4415,8 @@ define amdgpu_kernel void @constant_zextload_v64i16_to_v64i32(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[36:37] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[36:37] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[36:37] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s58 :: v_dual_mov_b32 v0, s6
-; GFX12-NEXT:    v_dual_mov_b32 v3, s57 :: v_dual_mov_b32 v2, s7
+; GFX12-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v3, s57
+; GFX12-NEXT:    v_dual_mov_b32 v1, s58 :: v_dual_mov_b32 v2, s7
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s56
 ; GFX12-NEXT:    s_lshr_b32 s51, s1, 16
 ; GFX12-NEXT:    s_lshr_b32 s52, s0, 16
@@ -4458,8 +4458,8 @@ define amdgpu_kernel void @constant_zextload_v64i16_to_v64i32(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[36:37] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[36:37] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[36:37] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v0, s26
-; GFX12-NEXT:    v_dual_mov_b32 v3, s45 :: v_dual_mov_b32 v2, s27
+; GFX12-NEXT:    v_dual_mov_b32 v0, s26 :: v_dual_mov_b32 v3, s45
+; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v2, s27
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s44
 ; GFX12-NEXT:    s_lshr_b32 s39, s21, 16
 ; GFX12-NEXT:    s_lshr_b32 s40, s20, 16
@@ -5353,8 +5353,8 @@ define amdgpu_kernel void @constant_sextload_v64i16_to_v64i32(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[36:37] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[36:37] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[36:37] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s58 :: v_dual_mov_b32 v0, s6
-; GFX12-NEXT:    v_dual_mov_b32 v3, s57 :: v_dual_mov_b32 v2, s7
+; GFX12-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v3, s57
+; GFX12-NEXT:    v_dual_mov_b32 v1, s58 :: v_dual_mov_b32 v2, s7
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s56
 ; GFX12-NEXT:    s_ashr_i32 s51, s1, 16
 ; GFX12-NEXT:    s_ashr_i32 s52, s0, 16
@@ -5397,8 +5397,8 @@ define amdgpu_kernel void @constant_sextload_v64i16_to_v64i32(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[36:37] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[36:37] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[36:37] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v0, s26
-; GFX12-NEXT:    v_dual_mov_b32 v3, s45 :: v_dual_mov_b32 v2, s27
+; GFX12-NEXT:    v_dual_mov_b32 v0, s26 :: v_dual_mov_b32 v3, s45
+; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v2, s27
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s44
 ; GFX12-NEXT:    s_ashr_i32 s39, s21, 16
 ; GFX12-NEXT:    s_ashr_i32 s40, s20, 16
@@ -7610,8 +7610,8 @@ define amdgpu_kernel void @constant_sextload_v16i16_to_v16i64(ptr addrspace(1) %
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[0:1] offset:80
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[0:1] offset:64
-; GFX12-NEXT:    v_dual_mov_b32 v1, s17 :: v_dual_mov_b32 v0, s16
-; GFX12-NEXT:    v_dual_mov_b32 v3, s7 :: v_dual_mov_b32 v2, s6
+; GFX12-NEXT:    v_dual_mov_b32 v0, s16 :: v_dual_mov_b32 v3, s7
+; GFX12-NEXT:    v_dual_mov_b32 v1, s17 :: v_dual_mov_b32 v2, s6
 ; GFX12-NEXT:    v_dual_mov_b32 v9, s13 :: v_dual_mov_b32 v8, s12
 ; GFX12-NEXT:    v_dual_mov_b32 v11, s15 :: v_dual_mov_b32 v10, s14
 ; GFX12-NEXT:    v_dual_mov_b32 v21, s3 :: v_dual_mov_b32 v20, s2
@@ -9128,9 +9128,9 @@ define amdgpu_kernel void @constant_sextload_v32i16_to_v32i64(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[16:17] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[16:17] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[16:17] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s51 :: v_dual_mov_b32 v0, s50
 ; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    v_dual_mov_b32 v3, s13 :: v_dual_mov_b32 v2, s12
+; GFX12-NEXT:    v_dual_mov_b32 v0, s50 :: v_dual_mov_b32 v3, s13
+; GFX12-NEXT:    v_dual_mov_b32 v1, s51 :: v_dual_mov_b32 v2, s12
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s57 :: v_dual_mov_b32 v4, s56
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s49 :: v_dual_mov_b32 v6, s48
 ; GFX12-NEXT:    v_dual_mov_b32 v9, s45 :: v_dual_mov_b32 v8, s44
@@ -9148,8 +9148,8 @@ define amdgpu_kernel void @constant_sextload_v32i16_to_v32i64(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[16:17] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[16:17] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[16:17] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s31 :: v_dual_mov_b32 v0, s30
-; GFX12-NEXT:    v_dual_mov_b32 v3, s29 :: v_dual_mov_b32 v2, s28
+; GFX12-NEXT:    v_dual_mov_b32 v0, s30 :: v_dual_mov_b32 v3, s29
+; GFX12-NEXT:    v_dual_mov_b32 v1, s31 :: v_dual_mov_b32 v2, s28
 ; GFX12-NEXT:    v_dual_mov_b32 v5, s3 :: v_dual_mov_b32 v4, s2
 ; GFX12-NEXT:    v_dual_mov_b32 v7, s27 :: v_dual_mov_b32 v6, s26
 ; GFX12-NEXT:    v_dual_mov_b32 v9, s25 :: v_dual_mov_b32 v8, s24
diff --git a/llvm/test/CodeGen/AMDGPU/load-constant-i32.ll b/llvm/test/CodeGen/AMDGPU/load-constant-i32.ll
index 68a6a148819e8..d86402a6fb62e 100644
--- a/llvm/test/CodeGen/AMDGPU/load-constant-i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-constant-i32.ll
@@ -3197,8 +3197,8 @@ define amdgpu_kernel void @constant_sextload_v16i32_to_v16i64(ptr addrspace(1) %
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v28, v[0:3], s[16:17] offset:112
 ; GFX12-NEXT:    global_store_b128 v28, v[4:7], s[16:17] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s19 :: v_dual_mov_b32 v0, s0
-; GFX12-NEXT:    v_dual_mov_b32 v3, s18 :: v_dual_mov_b32 v2, s1
+; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v3, s18
+; GFX12-NEXT:    v_dual_mov_b32 v1, s19 :: v_dual_mov_b32 v2, s1
 ; GFX12-NEXT:    s_clause 0x5
 ; GFX12-NEXT:    global_store_b128 v28, v[8:11], s[16:17] offset:80
 ; GFX12-NEXT:    global_store_b128 v28, v[12:15], s[16:17] offset:64
@@ -4401,9 +4401,9 @@ define amdgpu_kernel void @constant_sextload_v32i32_to_v32i64(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[36:37] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[36:37] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[36:37] offset:192
+; GFX12-NEXT:    v_dual_mov_b32 v0, s22 :: v_dual_mov_b32 v3, s57
 ; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    v_dual_mov_b32 v1, s24 :: v_dual_mov_b32 v0, s22
-; GFX12-NEXT:    v_dual_mov_b32 v3, s57 :: v_dual_mov_b32 v2, s23
+; GFX12-NEXT:    v_dual_mov_b32 v1, s24 :: v_dual_mov_b32 v2, s23
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s56
 ; GFX12-NEXT:    s_ashr_i32 s51, s17, 31
 ; GFX12-NEXT:    s_ashr_i32 s52, s16, 31
@@ -4433,8 +4433,8 @@ define amdgpu_kernel void @constant_sextload_v32i32_to_v32i64(ptr addrspace(1) %
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[36:37] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[36:37] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[36:37] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v0, s10
-; GFX12-NEXT:    v_dual_mov_b32 v3, s45 :: v_dual_mov_b32 v2, s11
+; GFX12-NEXT:    v_dual_mov_b32 v0, s10 :: v_dual_mov_b32 v3, s45
+; GFX12-NEXT:    v_dual_mov_b32 v1, s46 :: v_dual_mov_b32 v2, s11
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s44
 ; GFX12-NEXT:    s_ashr_i32 s39, s5, 31
 ; GFX12-NEXT:    s_ashr_i32 s40, s4, 31
diff --git a/llvm/test/CodeGen/AMDGPU/load-constant-i8.ll b/llvm/test/CodeGen/AMDGPU/load-constant-i8.ll
index 3b0f8523e1b52..88beb0683f8e0 100644
--- a/llvm/test/CodeGen/AMDGPU/load-constant-i8.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-constant-i8.ll
@@ -2827,8 +2827,8 @@ define amdgpu_kernel void @constant_zextload_v32i8_to_v32i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s19 :: v_dual_mov_b32 v0, s30
-; GFX12-NEXT:    v_dual_mov_b32 v3, s18 :: v_dual_mov_b32 v2, s8
+; GFX12-NEXT:    v_dual_mov_b32 v0, s30 :: v_dual_mov_b32 v3, s18
+; GFX12-NEXT:    v_dual_mov_b32 v1, s19 :: v_dual_mov_b32 v2, s8
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s17
 ; GFX12-NEXT:    s_lshr_b32 s12, s5, 24
 ; GFX12-NEXT:    s_bfe_u32 s13, s5, 0x80008
@@ -3335,8 +3335,8 @@ define amdgpu_kernel void @constant_sextload_v32i8_to_v32i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s24 :: v_dual_mov_b32 v0, s8
-; GFX12-NEXT:    v_dual_mov_b32 v3, s22 :: v_dual_mov_b32 v2, s23
+; GFX12-NEXT:    v_dual_mov_b32 v0, s8 :: v_dual_mov_b32 v3, s22
+; GFX12-NEXT:    v_dual_mov_b32 v1, s24 :: v_dual_mov_b32 v2, s23
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s21
 ; GFX12-NEXT:    s_ashr_i32 s13, s5, 24
 ; GFX12-NEXT:    s_bfe_i32 s14, s5, 0x80010
@@ -4203,8 +4203,8 @@ define amdgpu_kernel void @constant_zextload_v64i8_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[16:17] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[16:17] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[16:17] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s42 :: v_dual_mov_b32 v0, s62
-; GFX12-NEXT:    v_dual_mov_b32 v3, s41 :: v_dual_mov_b32 v2, s11
+; GFX12-NEXT:    v_dual_mov_b32 v0, s62 :: v_dual_mov_b32 v3, s41
+; GFX12-NEXT:    v_dual_mov_b32 v1, s42 :: v_dual_mov_b32 v2, s11
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s40
 ; GFX12-NEXT:    s_lshr_b32 s35, s8, 24
 ; GFX12-NEXT:    s_bfe_u32 s36, s8, 0x80008
@@ -4247,8 +4247,8 @@ define amdgpu_kernel void @constant_zextload_v64i8_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[16:17] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[16:17] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[16:17] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s29 :: v_dual_mov_b32 v0, s56
-; GFX12-NEXT:    v_dual_mov_b32 v3, s28 :: v_dual_mov_b32 v2, s5
+; GFX12-NEXT:    v_dual_mov_b32 v0, s56 :: v_dual_mov_b32 v3, s28
+; GFX12-NEXT:    v_dual_mov_b32 v1, s29 :: v_dual_mov_b32 v2, s5
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s27
 ; GFX12-NEXT:    s_lshr_b32 s22, s2, 24
 ; GFX12-NEXT:    s_bfe_u32 s23, s2, 0x80008
@@ -5163,8 +5163,8 @@ define amdgpu_kernel void @constant_sextload_v64i8_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[4:7], s[16:17] offset:224
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[16:17] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[16:17] offset:192
-; GFX12-NEXT:    v_dual_mov_b32 v1, s54 :: v_dual_mov_b32 v0, s11
-; GFX12-NEXT:    v_dual_mov_b32 v3, s52 :: v_dual_mov_b32 v2, s53
+; GFX12-NEXT:    v_dual_mov_b32 v0, s11 :: v_dual_mov_b32 v3, s52
+; GFX12-NEXT:    v_dual_mov_b32 v1, s54 :: v_dual_mov_b32 v2, s53
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s51
 ; GFX12-NEXT:    s_ashr_i32 s43, s8, 24
 ; GFX12-NEXT:    s_bfe_i32 s44, s8, 0x80010
@@ -5207,8 +5207,8 @@ define amdgpu_kernel void @constant_sextload_v64i8_to_v64i32(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[16:17] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[16:17] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[16:17] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s36 :: v_dual_mov_b32 v0, s5
-; GFX12-NEXT:    v_dual_mov_b32 v3, s34 :: v_dual_mov_b32 v2, s35
+; GFX12-NEXT:    v_dual_mov_b32 v0, s5 :: v_dual_mov_b32 v3, s34
+; GFX12-NEXT:    v_dual_mov_b32 v1, s36 :: v_dual_mov_b32 v2, s35
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s33
 ; GFX12-NEXT:    s_ashr_i32 s24, s2, 24
 ; GFX12-NEXT:    s_bfe_i32 s25, s2, 0x80010
@@ -7467,8 +7467,8 @@ define amdgpu_kernel void @constant_sextload_v16i8_to_v16i64(ptr addrspace(1) %o
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v24, v[0:3], s[0:1] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[0:1] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s23 :: v_dual_mov_b32 v0, s22
-; GFX12-NEXT:    v_dual_mov_b32 v3, s21 :: v_dual_mov_b32 v2, s20
+; GFX12-NEXT:    v_dual_mov_b32 v0, s22 :: v_dual_mov_b32 v3, s21
+; GFX12-NEXT:    v_dual_mov_b32 v1, s23 :: v_dual_mov_b32 v2, s20
 ; GFX12-NEXT:    v_dual_mov_b32 v9, s25 :: v_dual_mov_b32 v8, s24
 ; GFX12-NEXT:    v_dual_mov_b32 v11, s27 :: v_dual_mov_b32 v10, s26
 ; GFX12-NEXT:    v_dual_mov_b32 v21, s31 :: v_dual_mov_b32 v20, s30
@@ -9002,8 +9002,8 @@ define amdgpu_kernel void @constant_sextload_v32i8_to_v32i64(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[8:11], s[8:9] offset:208
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[8:9] offset:192
 ; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    v_dual_mov_b32 v1, s37 :: v_dual_mov_b32 v0, s36
-; GFX12-NEXT:    v_dual_mov_b32 v3, s71 :: v_dual_mov_b32 v2, s70
+; GFX12-NEXT:    v_dual_mov_b32 v0, s36 :: v_dual_mov_b32 v3, s71
+; GFX12-NEXT:    v_dual_mov_b32 v1, s37 :: v_dual_mov_b32 v2, s70
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s53
 ; GFX12-NEXT:    s_lshr_b32 s34, s3, 8
 ; GFX12-NEXT:    s_mov_b32 s30, s3
@@ -9044,8 +9044,8 @@ define amdgpu_kernel void @constant_sextload_v32i8_to_v32i64(ptr addrspace(1) %o
 ; GFX12-NEXT:    global_store_b128 v24, v[12:15], s[8:9] offset:128
 ; GFX12-NEXT:    global_store_b128 v24, v[16:19], s[8:9] offset:112
 ; GFX12-NEXT:    global_store_b128 v24, v[20:23], s[8:9] offset:96
-; GFX12-NEXT:    v_dual_mov_b32 v1, s25 :: v_dual_mov_b32 v0, s24
-; GFX12-NEXT:    v_dual_mov_b32 v3, s23 :: v_dual_mov_b32 v2, s22
+; GFX12-NEXT:    v_dual_mov_b32 v0, s24 :: v_dual_mov_b32 v3, s23
+; GFX12-NEXT:    v_dual_mov_b32 v1, s25 :: v_dual_mov_b32 v2, s22
 ; GFX12-NEXT:    v_mov_b32_e32 v5, s17
 ; GFX12-NEXT:    s_lshr_b32 s68, s0, 8
 ; GFX12-NEXT:    s_bfe_i64 s[6:7], s[62:63], 0x80000
diff --git a/llvm/test/CodeGen/AMDGPU/load-global-i16.ll b/llvm/test/CodeGen/AMDGPU/load-global-i16.ll
index 9054e509cde8e..15ea4b7b52eca 100644
--- a/llvm/test/CodeGen/AMDGPU/load-global-i16.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-global-i16.ll
@@ -3658,8 +3658,8 @@ define amdgpu_kernel void @global_zextload_v64i16_to_v64i32(ptr addrspace(1) %ou
 ; GCN-HSA-NEXT:    v_mov_b32_e32 v17, s9
 ; GCN-HSA-NEXT:    s_add_u32 s10, s2, 64
 ; GCN-HSA-NEXT:    v_mov_b32_e32 v16, s8
-; GCN-HSA-NEXT:    flat_load_dwordx4 v[20:23], v[16:17]
 ; GCN-HSA-NEXT:    s_addc_u32 s11, s3, 0
+; GCN-HSA-NEXT:    flat_load_dwordx4 v[20:23], v[16:17]
 ; GCN-HSA-NEXT:    v_mov_b32_e32 v0, s10
 ; GCN-HSA-NEXT:    v_mov_b32_e32 v1, s11
 ; GCN-HSA-NEXT:    s_add_u32 s10, s2, 0x50
diff --git a/llvm/test/CodeGen/AMDGPU/load-global-i32.ll b/llvm/test/CodeGen/AMDGPU/load-global-i32.ll
index e8c862a3cb93c..e55fb2cac0985 100644
--- a/llvm/test/CodeGen/AMDGPU/load-global-i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-global-i32.ll
@@ -3661,66 +3661,57 @@ define amdgpu_kernel void @global_sextload_v32i32_to_v32i64(ptr addrspace(1) %ou
 ; GCN-GFX900-HSA-NEXT:    s_mov_b64 s[22:23], s[2:3]
 ; GCN-GFX900-HSA-NEXT:    s_mov_b64 s[20:21], s[0:1]
 ; GCN-GFX900-HSA-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v8, 0
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v12, 0
 ; GCN-GFX900-HSA-NEXT:    s_add_u32 s20, s20, s17
 ; GCN-GFX900-HSA-NEXT:    s_addc_u32 s21, s21, 0
 ; GCN-GFX900-HSA-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[0:3], v8, s[2:3] offset:96
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[4:7], v8, s[2:3] offset:112
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[9:12], v8, s[2:3] offset:80
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[13:16], v8, s[2:3] offset:64
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[17:20], v8, s[2:3] offset:48
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[21:24], v8, s[2:3] offset:32
-; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(5)
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v28, 31, v3
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v26, 31, v2
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v25, v2
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v27, v3
-; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(4)
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[4:7], v12, s[2:3] offset:96
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[8:11], v12, s[2:3] offset:112
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[25:28], v12, s[2:3] offset:80
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[13:16], v12, s[2:3] offset:64
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[17:20], v12, s[2:3] offset:48
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[21:24], v12, s[2:3] offset:32
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[0:3], v12, s[2:3] offset:16
+; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(6)
 ; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v32, 31, v7
 ; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v30, 31, v6
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v36, 31, v5
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v34, 31, v4
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v33, v4
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v35, v5
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v29, v6
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v31, v7
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v7, 31, v1
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v5, 31, v0
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v4, v0
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v6, v1
-; GCN-GFX900-HSA-NEXT:    buffer_store_dword v25, off, s[20:23], 0 ; 4-byte Folded Spill
+; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(5)
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v36, 31, v11
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v34, 31, v10
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v40, 31, v9
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v38, 31, v8
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v37, v8
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v39, v9
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v33, v10
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v35, v11
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v11, 31, v5
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v9, 31, v4
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v8, v4
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v10, v5
+; GCN-GFX900-HSA-NEXT:    buffer_store_dword v29, off, s[20:23], 0 ; 4-byte Folded Spill
 ; GCN-GFX900-HSA-NEXT:    s_nop 0
-; GCN-GFX900-HSA-NEXT:    buffer_store_dword v26, off, s[20:23], 0 offset:4 ; 4-byte Folded Spill
-; GCN-GFX900-HSA-NEXT:    buffer_store_dword v27, off, s[20:23], 0 offset:8 ; 4-byte Folded Spill
-; GCN-GFX900-HSA-NEXT:    buffer_store_dword v28, off, s[20:23], 0 offset:12 ; 4-byte Folded Spill
+; GCN-GFX900-HSA-NEXT:    buffer_store_dword v30, off, s[20:23], 0 offset:4 ; 4-byte Folded Spill
+; GCN-GFX900-HSA-NEXT:    buffer_store_dword v31, off, s[20:23], 0 offset:8 ; 4-byte Folded Spill
+; GCN-GFX900-HSA-NEXT:    buffer_store_dword v32, off, s[20:23], 0 offset:12 ; 4-byte Folded Spill
 ; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(7)
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v28, 31, v12
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v26, 31, v11
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v40, 31, v10
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v38, 31, v9
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v37, v9
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v39, v10
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v25, v11
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v27, v12
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v7, 31, v16
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v5, 31, v15
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v48, 31, v14
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v46, 31, v13
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v45, v13
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v47, v14
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v4, v15
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v6, v16
 ; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(6)
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v12, 31, v16
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v10, 31, v15
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v44, 31, v14
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v42, 31, v13
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v41, v13
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v43, v14
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v9, v15
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v11, v16
-; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(5)
 ; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v16, 31, v20
 ; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v14, 31, v19
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v48, 31, v18
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v46, 31, v17
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v45, v17
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v47, v18
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v52, 31, v18
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v50, 31, v17
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v49, v17
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v51, v18
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v13, v19
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[49:52], v8, s[2:3] offset:16
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v15, v20
 ; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(5)
 ; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v20, 31, v24
@@ -3731,96 +3722,104 @@ define amdgpu_kernel void @global_sextload_v32i32_to_v32i64(ptr addrspace(1) %ou
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v55, v22
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v17, v23
 ; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v19, v24
-; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[21:24], v8, s[2:3]
+; GCN-GFX900-HSA-NEXT:    global_load_dwordx4 v[21:24], v12, s[2:3]
 ; GCN-GFX900-HSA-NEXT:    s_nop 0
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[33:36], s[0:1] offset:224
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[29:32], s[0:1] offset:240
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[4:7], s[0:1] offset:192
-; GCN-GFX900-HSA-NEXT:    buffer_load_dword v32, off, s[20:23], 0 ; 4-byte Folded Reload
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[37:40], s[0:1] offset:224
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[33:36], s[0:1] offset:240
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[8:11], s[0:1] offset:192
+; GCN-GFX900-HSA-NEXT:    buffer_load_dword v33, off, s[20:23], 0 ; 4-byte Folded Reload
 ; GCN-GFX900-HSA-NEXT:    s_nop 0
-; GCN-GFX900-HSA-NEXT:    buffer_load_dword v33, off, s[20:23], 0 offset:4 ; 4-byte Folded Reload
-; GCN-GFX900-HSA-NEXT:    buffer_load_dword v34, off, s[20:23], 0 offset:8 ; 4-byte Folded Reload
-; GCN-GFX900-HSA-NEXT:    buffer_load_dword v35, off, s[20:23], 0 offset:12 ; 4-byte Folded Reload
-; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(8)
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v60, 31, v52
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v58, 31, v51
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v3, 31, v50
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v1, 31, v49
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v0, v49
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v2, v50
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v57, v51
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v59, v52
+; GCN-GFX900-HSA-NEXT:    buffer_load_dword v34, off, s[20:23], 0 offset:4 ; 4-byte Folded Reload
+; GCN-GFX900-HSA-NEXT:    buffer_load_dword v35, off, s[20:23], 0 offset:8 ; 4-byte Folded Reload
+; GCN-GFX900-HSA-NEXT:    buffer_load_dword v36, off, s[20:23], 0 offset:12 ; 4-byte Folded Reload
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v32, 31, v28
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v30, 31, v27
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v44, 31, v26
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v42, 31, v25
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v41, v25
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v43, v26
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v29, v27
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v31, v28
+; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(12)
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v60, 31, v3
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v58, 31, v2
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v28, 31, v1
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v26, 31, v0
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v25, v0
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v27, v1
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v57, v2
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v59, v3
 ; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(7)
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v31, 31, v24
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v29, 31, v23
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v7, 31, v22
-; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v5, 31, v21
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v4, v21
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v6, v22
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v3, 31, v24
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v1, 31, v23
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v0, v23
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v2, v24
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v11, 31, v22
+; GCN-GFX900-HSA-NEXT:    v_ashrrev_i32_e32 v9, 31, v21
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v8, v21
+; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v10, v22
 ; GCN-GFX900-HSA-NEXT:    s_waitcnt vmcnt(0)
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[32:35], s[0:1] offset:208
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[37:40], s[0:1] offset:160
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[25:28], s[0:1] offset:176
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[41:44], s[0:1] offset:128
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[9:12], s[0:1] offset:144
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[45:48], s[0:1] offset:96
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[13:16], s[0:1] offset:112
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[53:56], s[0:1] offset:64
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[17:20], s[0:1] offset:80
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[0:3], s[0:1] offset:32
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[57:60], s[0:1] offset:48
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[4:7], s[0:1]
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v28, v23
-; GCN-GFX900-HSA-NEXT:    v_mov_b32_e32 v30, v24
-; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v8, v[28:31], s[0:1] offset:16
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[33:36], s[0:1] offset:208
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[41:44], s[0:1] offset:160
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[29:32], s[0:1] offset:176
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[45:48], s[0:1] offset:128
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[4:7], s[0:1] offset:144
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[49:52], s[0:1] offset:96
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[13:16], s[0:1] offset:112
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[53:56], s[0:1] offset:64
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[17:20], s[0:1] offset:80
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[25:28], s[0:1] offset:32
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[57:60], s[0:1] offset:48
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[8:11], s[0:1]
+; GCN-GFX900-HSA-NEXT:    global_store_dwordx4 v12, v[0:3], s[0:1] offset:16
 ; GCN-GFX900-HSA-NEXT:    s_endpgm
 ;
 ; GCN-GFX908-HSA-LABEL: global_sextload_v32i32_to_v32i64:
 ; GCN-GFX908-HSA:       ; %bb.0:
 ; GCN-GFX908-HSA-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v8, 0
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v12, 0
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[0:3], v8, s[2:3] offset:96
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[4:7], v8, s[2:3] offset:112
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[9:12], v8, s[2:3] offset:80
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[13:16], v8, s[2:3] offset:64
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[17:20], v8, s[2:3] offset:48
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[21:24], v8, s[2:3] offset:32
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[49:52], v8, s[2:3] offset:16
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[4:7], v12, s[2:3] offset:96
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[8:11], v12, s[2:3] offset:112
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[25:28], v12, s[2:3] offset:80
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[13:16], v12, s[2:3] offset:64
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[17:20], v12, s[2:3] offset:48
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[21:24], v12, s[2:3] offset:32
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[0:3], v12, s[2:3] offset:16
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(6)
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v25, v2
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v28, 31, v3
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v26, 31, v2
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v27, v3
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a0, v25
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a1, v26
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a2, v27
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a3, v28
-; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(4)
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v28, 31, v12
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v26, 31, v11
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v40, 31, v10
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v38, 31, v9
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v37, v9
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v39, v10
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v25, v11
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v27, v12
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v32, 31, v7
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v30, 31, v6
+; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(5)
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v36, 31, v11
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v34, 31, v10
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v40, 31, v9
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v38, 31, v8
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v37, v8
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v39, v9
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v33, v10
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v35, v11
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v11, 31, v5
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v9, 31, v4
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v8, v4
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v10, v5
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v29, v6
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v31, v7
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(3)
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v12, 31, v16
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v10, 31, v15
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v44, 31, v14
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v42, 31, v13
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v41, v13
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v43, v14
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v9, v15
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v11, v16
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v7, 31, v16
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v5, 31, v15
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v48, 31, v14
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v46, 31, v13
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v45, v13
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v47, v14
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v4, v15
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v6, v16
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(2)
 ; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v16, 31, v20
 ; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v14, 31, v19
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v48, 31, v18
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v46, 31, v17
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v45, v17
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v47, v18
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v52, 31, v18
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v50, 31, v17
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v49, v17
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v51, v18
 ; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v13, v19
 ; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v15, v20
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(1)
@@ -3832,57 +3831,57 @@ define amdgpu_kernel void @global_sextload_v32i32_to_v32i64(ptr addrspace(1) %ou
 ; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v55, v22
 ; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v17, v23
 ; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v19, v24
-; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[21:24], v8, s[2:3]
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v32, 31, v7
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v36, 31, v5
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v34, 31, v4
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v33, v4
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v35, v5
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v30, 31, v6
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v29, v6
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v31, v7
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[33:36], s[0:1] offset:224
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[29:32], s[0:1] offset:240
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v35, a3
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v7, 31, v1
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v5, 31, v0
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v4, v0
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v6, v1
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v34, a2
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v33, a1
-; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v32, a0
+; GCN-GFX908-HSA-NEXT:    global_load_dwordx4 v[21:24], v12, s[2:3]
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a0, v29
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a3, v32
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a1, v30
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_write_b32 a2, v31
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[37:40], s[0:1] offset:224
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[33:36], s[0:1] offset:240
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v32, 31, v28
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v36, a3
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v30, 31, v27
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v44, 31, v26
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v42, 31, v25
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v41, v25
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v43, v26
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v29, v27
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v31, v28
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(3)
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v60, 31, v52
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v58, 31, v51
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v3, 31, v50
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v1, 31, v49
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v0, v49
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v2, v50
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v57, v51
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v59, v52
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[4:7], s[0:1] offset:192
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v60, 31, v3
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v58, 31, v2
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v28, 31, v1
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v26, 31, v0
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v25, v0
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v27, v1
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v57, v2
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v59, v3
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v35, a2
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v34, a1
+; GCN-GFX908-HSA-NEXT:    v_accvgpr_read_b32 v33, a0
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[8:11], s[0:1] offset:192
 ; GCN-GFX908-HSA-NEXT:    s_waitcnt vmcnt(3)
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v31, 31, v24
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v29, 31, v23
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v7, 31, v22
-; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v5, 31, v21
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v4, v21
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v6, v22
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[32:35], s[0:1] offset:208
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[37:40], s[0:1] offset:160
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[25:28], s[0:1] offset:176
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[41:44], s[0:1] offset:128
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[9:12], s[0:1] offset:144
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[45:48], s[0:1] offset:96
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[13:16], s[0:1] offset:112
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[53:56], s[0:1] offset:64
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[17:20], s[0:1] offset:80
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[0:3], s[0:1] offset:32
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[57:60], s[0:1] offset:48
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[4:7], s[0:1]
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v28, v23
-; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v30, v24
-; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v8, v[28:31], s[0:1] offset:16
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v3, 31, v24
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v1, 31, v23
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v0, v23
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v2, v24
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v11, 31, v22
+; GCN-GFX908-HSA-NEXT:    v_ashrrev_i32_e32 v9, 31, v21
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v8, v21
+; GCN-GFX908-HSA-NEXT:    v_mov_b32_e32 v10, v22
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[33:36], s[0:1] offset:208
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[41:44], s[0:1] offset:160
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[29:32], s[0:1] offset:176
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[45:48], s[0:1] offset:128
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[4:7], s[0:1] offset:144
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[49:52], s[0:1] offset:96
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[13:16], s[0:1] offset:112
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[53:56], s[0:1] offset:64
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[17:20], s[0:1] offset:80
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[25:28], s[0:1] offset:32
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[57:60], s[0:1] offset:48
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[8:11], s[0:1]
+; GCN-GFX908-HSA-NEXT:    global_store_dwordx4 v12, v[0:3], s[0:1] offset:16
 ; GCN-GFX908-HSA-NEXT:    s_endpgm
   %ld = load <32 x i32>, ptr addrspace(1) %in
   %ext = sext <32 x i32> %ld to <32 x i64>
diff --git a/llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll b/llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll
index a6ce512164b89..8a3cc57e08579 100644
--- a/llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll
@@ -7,13 +7,13 @@
 define amdgpu_vs void @test(ptr addrspace(8) inreg %arg1, ptr addrspace(3) %arg2) {
 ; CHECK-LABEL: test:
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    v_add_i32_e32 v3, vcc, 12, v0
-; CHECK-NEXT:    v_add_i32_e32 v1, vcc, 8, v0
+; CHECK-NEXT:    v_add_i32_e32 v1, vcc, 12, v0
+; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 8, v0
 ; CHECK-NEXT:    v_add_i32_e32 v4, vcc, 4, v0
 ; CHECK-NEXT:    s_mov_b32 m0, -1
-; CHECK-NEXT:    ds_read_b32 v2, v1
+; CHECK-NEXT:    ds_read_b32 v3, v1
+; CHECK-NEXT:    ds_read_b32 v2, v2
 ; CHECK-NEXT:    ds_read_b32 v1, v4
-; CHECK-NEXT:    ds_read_b32 v3, v3
 ; CHECK-NEXT:    ds_read_b32 v0, v0
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    exp mrt0 off, off, off, off
@@ -69,36 +69,36 @@ define amdgpu_vs void @test_3(i32 inreg %arg1, i32 inreg %arg2, ptr addrspace(8)
 ; CHECK-NEXT:    v_add_i32_e32 v0, vcc, 12, v1
 ; CHECK-NEXT:    v_add_i32_e32 v3, vcc, 8, v1
 ; CHECK-NEXT:    v_add_i32_e32 v4, vcc, 4, v1
-; CHECK-NEXT:    v_add_i32_e32 v6, vcc, 20, v1
-; CHECK-NEXT:    v_add_i32_e32 v7, vcc, 16, v1
-; CHECK-NEXT:    v_mov_b32_e32 v9, s0
-; CHECK-NEXT:    v_add_i32_e32 v10, vcc, 12, v2
-; CHECK-NEXT:    v_add_i32_e32 v11, vcc, 8, v2
+; CHECK-NEXT:    v_add_i32_e32 v7, vcc, 20, v1
+; CHECK-NEXT:    v_add_i32_e32 v9, vcc, 16, v1
+; CHECK-NEXT:    v_mov_b32_e32 v10, s0
+; CHECK-NEXT:    v_add_i32_e32 v11, vcc, 12, v2
+; CHECK-NEXT:    v_add_i32_e32 v12, vcc, 8, v2
 ; CHECK-NEXT:    s_mov_b32 m0, -1
+; CHECK-NEXT:    ds_read_b32 v6, v0
 ; CHECK-NEXT:    ds_read_b32 v5, v3
 ; CHECK-NEXT:    ds_read_b32 v4, v4
-; CHECK-NEXT:    ds_read_b32 v8, v6
-; CHECK-NEXT:    ds_read_b32 v7, v7
-; CHECK-NEXT:    ds_read_b32 v6, v0
+; CHECK-NEXT:    ds_read_b32 v8, v7
+; CHECK-NEXT:    ds_read_b32 v7, v9
 ; CHECK-NEXT:    ds_read_b32 v3, v1
 ; CHECK-NEXT:    v_add_i32_e32 v0, vcc, 4, v2
 ; CHECK-NEXT:    v_add_i32_e32 v1, vcc, 20, v2
-; CHECK-NEXT:    v_add_i32_e32 v12, vcc, 16, v2
+; CHECK-NEXT:    v_add_i32_e32 v9, vcc, 16, v2
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
-; CHECK-NEXT:    tbuffer_store_format_xyzw v[3:6], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:264 glc slc
-; CHECK-NEXT:    tbuffer_store_format_xy v[7:8], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:280 glc slc
+; CHECK-NEXT:    tbuffer_store_format_xyzw v[3:6], v10, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:264 glc slc
+; CHECK-NEXT:    tbuffer_store_format_xy v[7:8], v10, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:280 glc slc
 ; CHECK-NEXT:    s_waitcnt expcnt(1)
-; CHECK-NEXT:    ds_read_b32 v4, v11
+; CHECK-NEXT:    ds_read_b32 v5, v11
+; CHECK-NEXT:    ds_read_b32 v4, v12
 ; CHECK-NEXT:    ds_read_b32 v3, v0
 ; CHECK-NEXT:    ds_read_b32 v1, v1
-; CHECK-NEXT:    ds_read_b32 v0, v12
-; CHECK-NEXT:    ds_read_b32 v5, v10
+; CHECK-NEXT:    ds_read_b32 v0, v9
 ; CHECK-NEXT:    ds_read_b32 v2, v2
-; CHECK-NEXT:    s_waitcnt lgkmcnt(2)
+; CHECK-NEXT:    s_waitcnt lgkmcnt(1)
 ; CHECK-NEXT:    exp mrt0 off, off, off, off
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
-; CHECK-NEXT:    tbuffer_store_format_xyzw v[2:5], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:240 glc slc
-; CHECK-NEXT:    tbuffer_store_format_xy v[0:1], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:256 glc slc
+; CHECK-NEXT:    tbuffer_store_format_xyzw v[2:5], v10, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:240 glc slc
+; CHECK-NEXT:    tbuffer_store_format_xy v[0:1], v10, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:256 glc slc
 ; CHECK-NEXT:    s_endpgm
   %load1 = load <6 x float>, ptr addrspace(3) %arg5, align 4
   %vec11 = shufflevector <6 x float> %load1, <6 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
diff --git a/llvm/test/CodeGen/AMDGPU/load-local.128.ll b/llvm/test/CodeGen/AMDGPU/load-local.128.ll
index 10dca76cc389a..d634e40f1d79b 100644
--- a/llvm/test/CodeGen/AMDGPU/load-local.128.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-local.128.ll
@@ -95,51 +95,51 @@ define <4 x i32> @load_lds_v4i32_align1(ptr addrspace(3) %ptr) {
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    s_mov_b32 m0, -1
-; GFX7-NEXT:    ds_read_u8 v1, v0 offset:6
-; GFX7-NEXT:    ds_read_u8 v2, v0 offset:4
-; GFX7-NEXT:    ds_read_u8 v3, v0 offset:2
-; GFX7-NEXT:    ds_read_u8 v4, v0 offset:1
+; GFX7-NEXT:    ds_read_u8 v1, v0 offset:1
+; GFX7-NEXT:    ds_read_u8 v2, v0 offset:6
+; GFX7-NEXT:    ds_read_u8 v3, v0 offset:4
+; GFX7-NEXT:    ds_read_u8 v4, v0 offset:2
 ; GFX7-NEXT:    ds_read_u8 v5, v0
 ; GFX7-NEXT:    ds_read_u8 v6, v0 offset:3
 ; GFX7-NEXT:    ds_read_u8 v7, v0 offset:5
 ; GFX7-NEXT:    ds_read_u8 v8, v0 offset:7
-; GFX7-NEXT:    s_waitcnt lgkmcnt(4)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
+; GFX7-NEXT:    s_waitcnt lgkmcnt(7)
+; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(3)
-; GFX7-NEXT:    v_or_b32_e32 v4, v4, v5
+; GFX7-NEXT:    v_or_b32_e32 v1, v1, v5
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(2)
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v5, 8, v6
-; GFX7-NEXT:    v_or_b32_e32 v3, v5, v3
-; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GFX7-NEXT:    v_or_b32_e32 v4, v3, v4
+; GFX7-NEXT:    v_or_b32_e32 v4, v5, v4
+; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GFX7-NEXT:    v_or_b32_e32 v4, v4, v1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(1)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 8, v7
-; GFX7-NEXT:    v_or_b32_e32 v2, v3, v2
+; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 8, v7
+; GFX7-NEXT:    v_or_b32_e32 v1, v1, v3
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 8, v8
-; GFX7-NEXT:    ds_read_u8 v5, v0 offset:15
-; GFX7-NEXT:    ds_read_u8 v6, v0 offset:14
+; GFX7-NEXT:    ds_read_u8 v5, v0 offset:9
+; GFX7-NEXT:    ds_read_u8 v6, v0 offset:11
 ; GFX7-NEXT:    ds_read_u8 v7, v0 offset:13
-; GFX7-NEXT:    ds_read_u8 v8, v0 offset:12
-; GFX7-NEXT:    ds_read_u8 v9, v0 offset:11
-; GFX7-NEXT:    ds_read_u8 v10, v0 offset:10
-; GFX7-NEXT:    ds_read_u8 v11, v0 offset:9
+; GFX7-NEXT:    ds_read_u8 v8, v0 offset:15
+; GFX7-NEXT:    ds_read_u8 v9, v0 offset:14
+; GFX7-NEXT:    ds_read_u8 v10, v0 offset:12
+; GFX7-NEXT:    ds_read_u8 v11, v0 offset:10
 ; GFX7-NEXT:    ds_read_u8 v0, v0 offset:8
-; GFX7-NEXT:    v_or_b32_e32 v1, v3, v1
-; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GFX7-NEXT:    v_or_b32_e32 v1, v1, v2
-; GFX7-NEXT:    s_waitcnt lgkmcnt(1)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v11
+; GFX7-NEXT:    v_or_b32_e32 v2, v3, v2
+; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
+; GFX7-NEXT:    s_waitcnt lgkmcnt(7)
+; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v5
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX7-NEXT:    v_or_b32_e32 v0, v2, v0
-; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v9
-; GFX7-NEXT:    v_or_b32_e32 v2, v2, v10
+; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v6
+; GFX7-NEXT:    v_or_b32_e32 v2, v2, v11
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 8, v5
+; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 8, v8
 ; GFX7-NEXT:    v_or_b32_e32 v2, v2, v0
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v0, 8, v7
-; GFX7-NEXT:    v_or_b32_e32 v3, v3, v6
-; GFX7-NEXT:    v_or_b32_e32 v0, v0, v8
+; GFX7-NEXT:    v_or_b32_e32 v3, v3, v9
+; GFX7-NEXT:    v_or_b32_e32 v0, v0, v10
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX7-NEXT:    v_or_b32_e32 v3, v3, v0
 ; GFX7-NEXT:    v_mov_b32_e32 v0, v4
@@ -331,21 +331,21 @@ define <4 x i32> @load_lds_v4i32_align2(ptr addrspace(3) %ptr) {
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    s_mov_b32 m0, -1
+; GFX7-NEXT:    ds_read_u16 v1, v0 offset:2
 ; GFX7-NEXT:    ds_read_u16 v3, v0 offset:12
 ; GFX7-NEXT:    ds_read_u16 v2, v0 offset:8
-; GFX7-NEXT:    ds_read_u16 v1, v0 offset:4
-; GFX7-NEXT:    ds_read_u16 v4, v0 offset:2
+; GFX7-NEXT:    ds_read_u16 v4, v0 offset:4
 ; GFX7-NEXT:    ds_read_u16 v5, v0
 ; GFX7-NEXT:    ds_read_u16 v6, v0 offset:6
 ; GFX7-NEXT:    ds_read_u16 v7, v0 offset:10
 ; GFX7-NEXT:    ds_read_u16 v8, v0 offset:14
-; GFX7-NEXT:    s_waitcnt lgkmcnt(4)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v0, 16, v4
+; GFX7-NEXT:    s_waitcnt lgkmcnt(7)
+; GFX7-NEXT:    v_lshlrev_b32_e32 v0, 16, v1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(3)
 ; GFX7-NEXT:    v_or_b32_e32 v0, v0, v5
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(2)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 16, v6
-; GFX7-NEXT:    v_or_b32_e32 v1, v4, v1
+; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v6
+; GFX7-NEXT:    v_or_b32_e32 v1, v1, v4
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(1)
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 16, v7
 ; GFX7-NEXT:    v_or_b32_e32 v2, v4, v2
diff --git a/llvm/test/CodeGen/AMDGPU/load-local.96.ll b/llvm/test/CodeGen/AMDGPU/load-local.96.ll
index 2da3fce72072e..b917b48b90e6a 100644
--- a/llvm/test/CodeGen/AMDGPU/load-local.96.ll
+++ b/llvm/test/CodeGen/AMDGPU/load-local.96.ll
@@ -86,41 +86,41 @@ define <3 x i32> @load_lds_v3i32_align1(ptr addrspace(3) %ptr) {
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    s_mov_b32 m0, -1
-; GFX7-NEXT:    ds_read_u8 v1, v0 offset:6
-; GFX7-NEXT:    ds_read_u8 v2, v0 offset:4
+; GFX7-NEXT:    ds_read_u8 v1, v0 offset:1
+; GFX7-NEXT:    ds_read_u8 v2, v0 offset:6
+; GFX7-NEXT:    ds_read_u8 v4, v0 offset:4
 ; GFX7-NEXT:    ds_read_u8 v3, v0 offset:2
-; GFX7-NEXT:    ds_read_u8 v4, v0 offset:1
 ; GFX7-NEXT:    ds_read_u8 v5, v0
 ; GFX7-NEXT:    ds_read_u8 v6, v0 offset:3
 ; GFX7-NEXT:    ds_read_u8 v7, v0 offset:5
 ; GFX7-NEXT:    ds_read_u8 v8, v0 offset:7
-; GFX7-NEXT:    s_waitcnt lgkmcnt(4)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
+; GFX7-NEXT:    s_waitcnt lgkmcnt(7)
+; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(3)
-; GFX7-NEXT:    v_or_b32_e32 v4, v4, v5
+; GFX7-NEXT:    v_or_b32_e32 v1, v1, v5
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(2)
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v5, 8, v6
 ; GFX7-NEXT:    v_or_b32_e32 v3, v5, v3
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
-; GFX7-NEXT:    v_or_b32_e32 v3, v3, v4
+; GFX7-NEXT:    v_or_b32_e32 v3, v3, v1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(1)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 8, v7
-; GFX7-NEXT:    ds_read_u8 v5, v0 offset:11
-; GFX7-NEXT:    ds_read_u8 v6, v0 offset:10
-; GFX7-NEXT:    ds_read_u8 v7, v0 offset:9
+; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 8, v7
+; GFX7-NEXT:    ds_read_u8 v5, v0 offset:9
+; GFX7-NEXT:    ds_read_u8 v6, v0 offset:11
+; GFX7-NEXT:    ds_read_u8 v7, v0 offset:10
 ; GFX7-NEXT:    ds_read_u8 v0, v0 offset:8
-; GFX7-NEXT:    v_or_b32_e32 v2, v4, v2
+; GFX7-NEXT:    v_or_b32_e32 v1, v1, v4
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(4)
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 8, v8
-; GFX7-NEXT:    v_or_b32_e32 v1, v4, v1
-; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GFX7-NEXT:    v_or_b32_e32 v1, v1, v2
-; GFX7-NEXT:    s_waitcnt lgkmcnt(1)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v7
+; GFX7-NEXT:    v_or_b32_e32 v2, v4, v2
+; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
+; GFX7-NEXT:    s_waitcnt lgkmcnt(3)
+; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v5
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX7-NEXT:    v_or_b32_e32 v0, v2, v0
-; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v5
-; GFX7-NEXT:    v_or_b32_e32 v2, v2, v6
+; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v6
+; GFX7-NEXT:    v_or_b32_e32 v2, v2, v7
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GFX7-NEXT:    v_or_b32_e32 v2, v2, v0
 ; GFX7-NEXT:    v_mov_b32_e32 v0, v3
@@ -274,19 +274,19 @@ define <3 x i32> @load_lds_v3i32_align2(ptr addrspace(3) %ptr) {
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    s_mov_b32 m0, -1
+; GFX7-NEXT:    ds_read_u16 v1, v0 offset:2
 ; GFX7-NEXT:    ds_read_u16 v2, v0 offset:8
-; GFX7-NEXT:    ds_read_u16 v1, v0 offset:4
-; GFX7-NEXT:    ds_read_u16 v3, v0 offset:2
+; GFX7-NEXT:    ds_read_u16 v3, v0 offset:4
 ; GFX7-NEXT:    ds_read_u16 v4, v0
 ; GFX7-NEXT:    ds_read_u16 v5, v0 offset:6
 ; GFX7-NEXT:    ds_read_u16 v6, v0 offset:10
-; GFX7-NEXT:    s_waitcnt lgkmcnt(3)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v0, 16, v3
+; GFX7-NEXT:    s_waitcnt lgkmcnt(5)
+; GFX7-NEXT:    v_lshlrev_b32_e32 v0, 16, v1
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(2)
 ; GFX7-NEXT:    v_or_b32_e32 v0, v0, v4
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(1)
-; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v5
-; GFX7-NEXT:    v_or_b32_e32 v1, v3, v1
+; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v5
+; GFX7-NEXT:    v_or_b32_e32 v1, v1, v3
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v6
 ; GFX7-NEXT:    v_or_b32_e32 v2, v3, v2
diff --git a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll
index 2b10d469acf5c..97db15ba637a5 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll
@@ -27,12 +27,12 @@ define amdgpu_kernel void @buffer_last_use_load_0(ptr addrspace(7) %in, ptr addr
 ; GFX12-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX12-NEXT:    s_mov_b32 s5, s12
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_mov_b32 s4, s3
-; GFX12-NEXT:    s_mov_b32 s3, s12
+; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX12-NEXT:    s_mov_b32 s13, s2
 ; GFX12-NEXT:    s_mov_b32 s2, s1
+; GFX12-NEXT:    s_mov_b32 s3, s12
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
@@ -69,12 +69,12 @@ define amdgpu_kernel void @buffer_last_use_load_1(ptr addrspace(7) %in, ptr addr
 ; GFX12-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX12-NEXT:    s_mov_b32 s5, s12
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_mov_b32 s4, s3
-; GFX12-NEXT:    s_mov_b32 s3, s12
+; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX12-NEXT:    s_mov_b32 s13, s2
 ; GFX12-NEXT:    s_mov_b32 s2, s1
+; GFX12-NEXT:    s_mov_b32 s3, s12
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
@@ -112,12 +112,12 @@ define amdgpu_kernel void @buffer_last_use_and_volatile_load(ptr addrspace(7) %i
 ; GFX12-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX12-NEXT:    s_mov_b32 s5, s12
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_mov_b32 s4, s3
-; GFX12-NEXT:    s_mov_b32 s3, s12
+; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX12-NEXT:    s_mov_b32 s13, s2
 ; GFX12-NEXT:    s_mov_b32 s2, s1
+; GFX12-NEXT:    s_mov_b32 s3, s12
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
@@ -153,12 +153,12 @@ define amdgpu_kernel void @buffer_last_use_and_nontemporal_load(ptr addrspace(7)
 ; GFX12-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX12-NEXT:    s_mov_b32 s5, s12
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_mov_b32 s4, s3
-; GFX12-NEXT:    s_mov_b32 s3, s12
+; GFX12-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX12-NEXT:    s_mov_b32 s13, s2
 ; GFX12-NEXT:    s_mov_b32 s2, s1
+; GFX12-NEXT:    s_mov_b32 s3, s12
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
diff --git a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll
index b4bbe849c08b9..10225bbeb7172 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll
@@ -218,12 +218,12 @@ define amdgpu_kernel void @buffer_nontemporal_load_store(ptr addrspace(7) %in, p
 ; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX11-SDAG-NEXT:    s_mov_b32 s5, s12
 ; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-SDAG-NEXT:    s_mov_b32 s4, s3
-; GFX11-SDAG-NEXT:    s_mov_b32 s3, s12
+; GFX11-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-SDAG-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX11-SDAG-NEXT:    s_mov_b32 s13, s2
 ; GFX11-SDAG-NEXT:    s_mov_b32 s2, s1
+; GFX11-SDAG-NEXT:    s_mov_b32 s3, s12
 ; GFX11-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-SDAG-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0)
@@ -253,12 +253,12 @@ define amdgpu_kernel void @buffer_nontemporal_load_store(ptr addrspace(7) %in, p
 ; GFX11-GISEL-NEXT:    s_load_b32 s7, s[4:5], 0x30
 ; GFX11-GISEL-NEXT:    s_mov_b32 s4, s9
 ; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-GISEL-NEXT:    s_mov_b32 s8, s1
 ; GFX11-GISEL-NEXT:    s_mov_b32 s5, s2
-; GFX11-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX11-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-GISEL-NEXT:    s_or_b64 s[4:5], s[8:9], s[4:5]
 ; GFX11-GISEL-NEXT:    s_mov_b32 s8, s3
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-GISEL-NEXT:    s_or_b64 s[6:7], s[8:9], s[6:7]
 ; GFX11-GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-GISEL-NEXT:    buffer_store_b32 v0, v1, s[4:7], 0 offen glc slc dlc
@@ -287,12 +287,12 @@ define amdgpu_kernel void @buffer_nontemporal_load_store(ptr addrspace(7) %in, p
 ; GFX12-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX12-SDAG-NEXT:    s_mov_b32 s5, s12
 ; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
-; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-SDAG-NEXT:    s_mov_b32 s4, s3
-; GFX12-SDAG-NEXT:    s_mov_b32 s3, s12
+; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-SDAG-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX12-SDAG-NEXT:    s_mov_b32 s13, s2
 ; GFX12-SDAG-NEXT:    s_mov_b32 s2, s1
+; GFX12-SDAG-NEXT:    s_mov_b32 s3, s12
 ; GFX12-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-SDAG-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX12-SDAG-NEXT:    s_wait_loadcnt 0x0
@@ -322,12 +322,12 @@ define amdgpu_kernel void @buffer_nontemporal_load_store(ptr addrspace(7) %in, p
 ; GFX12-GISEL-NEXT:    s_load_b32 s7, s[4:5], 0x30
 ; GFX12-GISEL-NEXT:    s_mov_b32 s4, s9
 ; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-GISEL-NEXT:    s_mov_b32 s8, s1
 ; GFX12-GISEL-NEXT:    s_mov_b32 s5, s2
-; GFX12-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX12-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-GISEL-NEXT:    s_or_b64 s[4:5], s[8:9], s[4:5]
 ; GFX12-GISEL-NEXT:    s_mov_b32 s8, s3
+; GFX12-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-GISEL-NEXT:    s_or_b64 s[6:7], s[8:9], s[6:7]
 ; GFX12-GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-GISEL-NEXT:    buffer_store_b32 v0, v1, s[4:7], null offen th:TH_STORE_NT
@@ -546,12 +546,12 @@ define amdgpu_kernel void @buffer_nontemporal_and_volatile_load_store(ptr addrsp
 ; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX11-SDAG-NEXT:    s_mov_b32 s5, s12
 ; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-SDAG-NEXT:    s_mov_b32 s4, s3
-; GFX11-SDAG-NEXT:    s_mov_b32 s3, s12
+; GFX11-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-SDAG-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX11-SDAG-NEXT:    s_mov_b32 s13, s2
 ; GFX11-SDAG-NEXT:    s_mov_b32 s2, s1
+; GFX11-SDAG-NEXT:    s_mov_b32 s3, s12
 ; GFX11-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-SDAG-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0)
@@ -581,12 +581,12 @@ define amdgpu_kernel void @buffer_nontemporal_and_volatile_load_store(ptr addrsp
 ; GFX11-GISEL-NEXT:    s_load_b32 s7, s[4:5], 0x30
 ; GFX11-GISEL-NEXT:    s_mov_b32 s4, s9
 ; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-GISEL-NEXT:    s_mov_b32 s8, s1
 ; GFX11-GISEL-NEXT:    s_mov_b32 s5, s2
-; GFX11-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX11-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX11-GISEL-NEXT:    s_or_b64 s[4:5], s[8:9], s[4:5]
 ; GFX11-GISEL-NEXT:    s_mov_b32 s8, s3
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-GISEL-NEXT:    s_or_b64 s[6:7], s[8:9], s[6:7]
 ; GFX11-GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-GISEL-NEXT:    buffer_store_b32 v0, v1, s[4:7], 0 offen dlc
@@ -615,12 +615,12 @@ define amdgpu_kernel void @buffer_nontemporal_and_volatile_load_store(ptr addrsp
 ; GFX12-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x20
 ; GFX12-SDAG-NEXT:    s_mov_b32 s5, s12
 ; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
-; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-SDAG-NEXT:    s_mov_b32 s4, s3
-; GFX12-SDAG-NEXT:    s_mov_b32 s3, s12
+; GFX12-SDAG-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-SDAG-NEXT:    s_or_b64 s[6:7], s[4:5], s[12:13]
 ; GFX12-SDAG-NEXT:    s_mov_b32 s13, s2
 ; GFX12-SDAG-NEXT:    s_mov_b32 s2, s1
+; GFX12-SDAG-NEXT:    s_mov_b32 s3, s12
 ; GFX12-SDAG-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-SDAG-NEXT:    s_or_b64 s[4:5], s[2:3], s[12:13]
 ; GFX12-SDAG-NEXT:    s_wait_loadcnt 0x0
@@ -650,12 +650,12 @@ define amdgpu_kernel void @buffer_nontemporal_and_volatile_load_store(ptr addrsp
 ; GFX12-GISEL-NEXT:    s_load_b32 s7, s[4:5], 0x30
 ; GFX12-GISEL-NEXT:    s_mov_b32 s4, s9
 ; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-GISEL-NEXT:    s_mov_b32 s8, s1
 ; GFX12-GISEL-NEXT:    s_mov_b32 s5, s2
-; GFX12-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX12-GISEL-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX12-GISEL-NEXT:    s_or_b64 s[4:5], s[8:9], s[4:5]
 ; GFX12-GISEL-NEXT:    s_mov_b32 s8, s3
+; GFX12-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-GISEL-NEXT:    s_or_b64 s[6:7], s[8:9], s[6:7]
 ; GFX12-GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-GISEL-NEXT:    buffer_store_b32 v0, v1, s[4:7], null offen th:TH_STORE_NT scope:SCOPE_SYS
diff --git a/llvm/test/CodeGen/AMDGPU/max.i16.ll b/llvm/test/CodeGen/AMDGPU/max.i16.ll
index 1857eaba0a2a9..1e246465ab1e3 100644
--- a/llvm/test/CodeGen/AMDGPU/max.i16.ll
+++ b/llvm/test/CodeGen/AMDGPU/max.i16.ll
@@ -139,17 +139,17 @@ define amdgpu_kernel void @v_test_imax_sge_v3i16(ptr addrspace(1) %out, ptr addr
 ;
 ; GFX9-LABEL: v_test_imax_sge_v3i16:
 ; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
 ; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX9-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX9-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX9-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-NEXT:    global_load_dword v3, v0, s[6:7]
-; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    global_load_short_d16 v2, v0, s[2:3] offset:4
 ; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    global_load_dword v3, v0, s[6:7]
 ; GFX9-NEXT:    global_load_dword v4, v0, s[2:3]
+; GFX9-NEXT:    ; kill: killed $sgpr2_sgpr3
 ; GFX9-NEXT:    s_nop 0
 ; GFX9-NEXT:    global_load_short_d16 v1, v0, s[6:7] offset:4
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
index 8d020b9e1a603..0003366f3a3ea 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
@@ -173,54 +173,53 @@ define amdgpu_kernel void @memcpy_p0_p5_minsize(ptr %generic, ptr addrspace(5) %
 ; CHECK-NEXT:    v_mov_b32_e32 v26, s0
 ; CHECK-NEXT:    buffer_load_dword v3, v26, s[20:23], 0 offen offset:124
 ; CHECK-NEXT:    buffer_load_dword v2, v26, s[20:23], 0 offen offset:120
+; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:100
+; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:108
 ; CHECK-NEXT:    buffer_load_dword v1, v26, s[20:23], 0 offen offset:116
 ; CHECK-NEXT:    buffer_load_dword v0, v26, s[20:23], 0 offen offset:112
-; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:108
 ; CHECK-NEXT:    buffer_load_dword v6, v26, s[20:23], 0 offen offset:104
-; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:100
 ; CHECK-NEXT:    buffer_load_dword v4, v26, s[20:23], 0 offen offset:96
 ; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; CHECK-NEXT:    buffer_load_dword v8, v26, s[20:23], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v9, v26, s[20:23], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v10, v26, s[20:23], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v11, v26, s[20:23], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v12, v26, s[20:23], 0 offen offset:32
-; CHECK-NEXT:    buffer_load_dword v13, v26, s[20:23], 0 offen offset:36
-; CHECK-NEXT:    buffer_load_dword v14, v26, s[20:23], 0 offen offset:40
-; CHECK-NEXT:    buffer_load_dword v15, v26, s[20:23], 0 offen offset:44
-; CHECK-NEXT:    buffer_load_dword v16, v26, s[20:23], 0 offen offset:48
-; CHECK-NEXT:    buffer_load_dword v17, v26, s[20:23], 0 offen offset:52
-; CHECK-NEXT:    buffer_load_dword v18, v26, s[20:23], 0 offen offset:56
-; CHECK-NEXT:    buffer_load_dword v19, v26, s[20:23], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v8, v26, s[20:23], 0 offen offset:32
+; CHECK-NEXT:    buffer_load_dword v9, v26, s[20:23], 0 offen offset:36
+; CHECK-NEXT:    buffer_load_dword v10, v26, s[20:23], 0 offen offset:40
+; CHECK-NEXT:    buffer_load_dword v11, v26, s[20:23], 0 offen offset:44
+; CHECK-NEXT:    buffer_load_dword v12, v26, s[20:23], 0 offen offset:48
+; CHECK-NEXT:    buffer_load_dword v13, v26, s[20:23], 0 offen offset:52
+; CHECK-NEXT:    buffer_load_dword v14, v26, s[20:23], 0 offen offset:56
+; CHECK-NEXT:    buffer_load_dword v15, v26, s[20:23], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v17, v26, s[20:23], 0 offen offset:68
+; CHECK-NEXT:    buffer_load_dword v19, v26, s[20:23], 0 offen offset:76
+; CHECK-NEXT:    buffer_load_dword v21, v26, s[20:23], 0 offen offset:84
 ; CHECK-NEXT:    buffer_load_dword v23, v26, s[20:23], 0 offen offset:92
 ; CHECK-NEXT:    buffer_load_dword v22, v26, s[20:23], 0 offen offset:88
-; CHECK-NEXT:    buffer_load_dword v21, v26, s[20:23], 0 offen offset:84
 ; CHECK-NEXT:    buffer_load_dword v20, v26, s[20:23], 0 offen offset:80
+; CHECK-NEXT:    buffer_load_dword v18, v26, s[20:23], 0 offen offset:72
+; CHECK-NEXT:    buffer_load_dword v16, v26, s[20:23], 0 offen offset:64
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    v_mov_b32_e32 v25, s1
 ; CHECK-NEXT:    v_mov_b32_e32 v24, s0
-; CHECK-NEXT:    s_waitcnt vmcnt(20)
+; CHECK-NEXT:    s_waitcnt vmcnt(18)
 ; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[0:3] offset:112
-; CHECK-NEXT:    buffer_load_dword v3, v26, s[20:23], 0 offen offset:76
-; CHECK-NEXT:    s_nop 0
-; CHECK-NEXT:    buffer_load_dword v2, v26, s[20:23], 0 offen offset:72
-; CHECK-NEXT:    buffer_load_dword v1, v26, s[20:23], 0 offen offset:68
-; CHECK-NEXT:    buffer_load_dword v0, v26, s[20:23], 0 offen offset:64
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[4:7] offset:96
-; CHECK-NEXT:    buffer_load_dword v4, v26, s[20:23], 0 offen
+; CHECK-NEXT:    buffer_load_dword v0, v26, s[20:23], 0 offen
+; CHECK-NEXT:    buffer_load_dword v1, v26, s[20:23], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v2, v26, s[20:23], 0 offen offset:8
 ; CHECK-NEXT:    s_nop 0
-; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v6, v26, s[20:23], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_dword v4, v26, s[20:23], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v6, v26, s[20:23], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v3, v26, s[20:23], 0 offen offset:12
 ; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[20:23] offset:80
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[0:3] offset:64
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[16:19] offset:48
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[12:15] offset:32
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[8:11] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[16:19] offset:64
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[12:15] offset:48
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[8:11] offset:32
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[4:7]
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[4:7] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[0:3]
 ; CHECK-NEXT:    s_endpgm
 entry:
   tail call void @llvm.memcpy.p0.p5.i64(ptr %generic, ptr addrspace(5) %src, i64 128, i1 false)
@@ -464,54 +463,53 @@ define amdgpu_kernel void @memcpy_p0_p5_optsize(ptr %generic, ptr addrspace(5) %
 ; CHECK-NEXT:    v_mov_b32_e32 v26, s0
 ; CHECK-NEXT:    buffer_load_dword v3, v26, s[20:23], 0 offen offset:124
 ; CHECK-NEXT:    buffer_load_dword v2, v26, s[20:23], 0 offen offset:120
+; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:100
+; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:108
 ; CHECK-NEXT:    buffer_load_dword v1, v26, s[20:23], 0 offen offset:116
 ; CHECK-NEXT:    buffer_load_dword v0, v26, s[20:23], 0 offen offset:112
-; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:108
 ; CHECK-NEXT:    buffer_load_dword v6, v26, s[20:23], 0 offen offset:104
-; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:100
 ; CHECK-NEXT:    buffer_load_dword v4, v26, s[20:23], 0 offen offset:96
 ; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; CHECK-NEXT:    buffer_load_dword v8, v26, s[20:23], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v9, v26, s[20:23], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v10, v26, s[20:23], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v11, v26, s[20:23], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v12, v26, s[20:23], 0 offen offset:32
-; CHECK-NEXT:    buffer_load_dword v13, v26, s[20:23], 0 offen offset:36
-; CHECK-NEXT:    buffer_load_dword v14, v26, s[20:23], 0 offen offset:40
-; CHECK-NEXT:    buffer_load_dword v15, v26, s[20:23], 0 offen offset:44
-; CHECK-NEXT:    buffer_load_dword v16, v26, s[20:23], 0 offen offset:48
-; CHECK-NEXT:    buffer_load_dword v17, v26, s[20:23], 0 offen offset:52
-; CHECK-NEXT:    buffer_load_dword v18, v26, s[20:23], 0 offen offset:56
-; CHECK-NEXT:    buffer_load_dword v19, v26, s[20:23], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v8, v26, s[20:23], 0 offen offset:32
+; CHECK-NEXT:    buffer_load_dword v9, v26, s[20:23], 0 offen offset:36
+; CHECK-NEXT:    buffer_load_dword v10, v26, s[20:23], 0 offen offset:40
+; CHECK-NEXT:    buffer_load_dword v11, v26, s[20:23], 0 offen offset:44
+; CHECK-NEXT:    buffer_load_dword v12, v26, s[20:23], 0 offen offset:48
+; CHECK-NEXT:    buffer_load_dword v13, v26, s[20:23], 0 offen offset:52
+; CHECK-NEXT:    buffer_load_dword v14, v26, s[20:23], 0 offen offset:56
+; CHECK-NEXT:    buffer_load_dword v15, v26, s[20:23], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v17, v26, s[20:23], 0 offen offset:68
+; CHECK-NEXT:    buffer_load_dword v19, v26, s[20:23], 0 offen offset:76
+; CHECK-NEXT:    buffer_load_dword v21, v26, s[20:23], 0 offen offset:84
 ; CHECK-NEXT:    buffer_load_dword v23, v26, s[20:23], 0 offen offset:92
 ; CHECK-NEXT:    buffer_load_dword v22, v26, s[20:23], 0 offen offset:88
-; CHECK-NEXT:    buffer_load_dword v21, v26, s[20:23], 0 offen offset:84
 ; CHECK-NEXT:    buffer_load_dword v20, v26, s[20:23], 0 offen offset:80
+; CHECK-NEXT:    buffer_load_dword v18, v26, s[20:23], 0 offen offset:72
+; CHECK-NEXT:    buffer_load_dword v16, v26, s[20:23], 0 offen offset:64
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    v_mov_b32_e32 v25, s1
 ; CHECK-NEXT:    v_mov_b32_e32 v24, s0
-; CHECK-NEXT:    s_waitcnt vmcnt(20)
+; CHECK-NEXT:    s_waitcnt vmcnt(18)
 ; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[0:3] offset:112
-; CHECK-NEXT:    buffer_load_dword v3, v26, s[20:23], 0 offen offset:76
-; CHECK-NEXT:    s_nop 0
-; CHECK-NEXT:    buffer_load_dword v2, v26, s[20:23], 0 offen offset:72
-; CHECK-NEXT:    buffer_load_dword v1, v26, s[20:23], 0 offen offset:68
-; CHECK-NEXT:    buffer_load_dword v0, v26, s[20:23], 0 offen offset:64
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[4:7] offset:96
-; CHECK-NEXT:    buffer_load_dword v4, v26, s[20:23], 0 offen
+; CHECK-NEXT:    buffer_load_dword v0, v26, s[20:23], 0 offen
+; CHECK-NEXT:    buffer_load_dword v1, v26, s[20:23], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v2, v26, s[20:23], 0 offen offset:8
 ; CHECK-NEXT:    s_nop 0
-; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v6, v26, s[20:23], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_dword v4, v26, s[20:23], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v5, v26, s[20:23], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v6, v26, s[20:23], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v7, v26, s[20:23], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v3, v26, s[20:23], 0 offen offset:12
 ; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[20:23] offset:80
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[0:3] offset:64
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[16:19] offset:48
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[12:15] offset:32
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[8:11] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[16:19] offset:64
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[12:15] offset:48
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[8:11] offset:32
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[4:7]
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[4:7] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[24:25], v[0:3]
 ; CHECK-NEXT:    s_endpgm
 entry:
   tail call void @llvm.memcpy.p0.p5.i64(ptr %generic, ptr addrspace(5) %src, i64 128, i1 false)
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll b/llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll
index cadc3dadb0a1e..b43ccc551ca95 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll
@@ -451,12 +451,12 @@ define void @memcpy_p0_p3_sz31_align_1_1(ptr addrspace(0) align 1 %dst, ptr addr
 ; CHECK-LABEL: memcpy_p0_p3_sz31_align_1_1:
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_b32 v8, v2 offset:24
+; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_u16 v10, v2 offset:28
 ; CHECK-NEXT:    ds_read_b64 v[6:7], v2 offset:16
 ; CHECK-NEXT:    ds_read2_b64 v[2:5], v2 offset1:1
-; CHECK-NEXT:    s_waitcnt lgkmcnt(4)
+; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_byte v[0:1], v9 offset:30
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
@@ -506,12 +506,12 @@ define void @memcpy_p0_p3_sz31_align_2_2(ptr addrspace(0) align 2 %dst, ptr addr
 ; CHECK-LABEL: memcpy_p0_p3_sz31_align_2_2:
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_b32 v8, v2 offset:24
+; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_u16 v10, v2 offset:28
 ; CHECK-NEXT:    ds_read_b64 v[6:7], v2 offset:16
 ; CHECK-NEXT:    ds_read2_b64 v[2:5], v2 offset1:1
-; CHECK-NEXT:    s_waitcnt lgkmcnt(4)
+; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_byte v[0:1], v9 offset:30
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
@@ -896,19 +896,18 @@ define void @memcpy_p0_p5_sz31_align_1_1(ptr addrspace(0) align 1 %dst, ptr addr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_byte v[0:1], v11 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    flat_store_short v[0:1], v11 offset:28
+; CHECK-NEXT:    flat_store_byte v[0:1], v10 offset:30
 ; CHECK-NEXT:    flat_store_dwordx3 v[0:1], v[7:9] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
@@ -924,18 +923,18 @@ define void @memcpy_p0_p5_sz32_align_1_1(ptr addrspace(0) align 1 %dst, ptr addr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x7
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6] offset:16
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10]
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -966,19 +965,18 @@ define void @memcpy_p0_p5_sz31_align_2_2(ptr addrspace(0) align 2 %dst, ptr addr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_byte v[0:1], v11 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    flat_store_short v[0:1], v11 offset:28
+; CHECK-NEXT:    flat_store_byte v[0:1], v10 offset:30
 ; CHECK-NEXT:    flat_store_dwordx3 v[0:1], v[7:9] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
@@ -994,18 +992,18 @@ define void @memcpy_p0_p5_sz32_align_2_2(ptr addrspace(0) align 2 %dst, ptr addr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x7
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6] offset:16
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10]
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
diff --git a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
index 0f1c1cf0d80af..9cc42ac448067 100644
--- a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
+++ b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
@@ -3583,104 +3583,102 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; CHECK-NEXT:  .LBB4_1: ; %load-store-loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3e
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:32
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:36
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:40
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:44
-; CHECK-NEXT:    buffer_load_dword v11, v2, s[0:3], 0 offen offset:48
-; CHECK-NEXT:    buffer_load_dword v12, v2, s[0:3], 0 offen offset:52
-; CHECK-NEXT:    buffer_load_dword v13, v2, s[0:3], 0 offen offset:56
-; CHECK-NEXT:    buffer_load_dword v14, v2, s[0:3], 0 offen offset:60
-; CHECK-NEXT:    buffer_load_dword v18, v2, s[0:3], 0 offen offset:124
-; CHECK-NEXT:    buffer_load_dword v17, v2, s[0:3], 0 offen offset:120
-; CHECK-NEXT:    buffer_load_dword v16, v2, s[0:3], 0 offen offset:116
-; CHECK-NEXT:    buffer_load_dword v15, v2, s[0:3], 0 offen offset:112
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:32
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:36
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:40
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:44
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:48
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:52
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:56
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v14, v2, s[0:3], 0 offen offset:76
+; CHECK-NEXT:    buffer_load_dword v18, v2, s[0:3], 0 offen offset:92
+; CHECK-NEXT:    buffer_load_dword v17, v2, s[0:3], 0 offen offset:88
+; CHECK-NEXT:    buffer_load_dword v16, v2, s[0:3], 0 offen offset:84
+; CHECK-NEXT:    buffer_load_dword v15, v2, s[0:3], 0 offen offset:80
+; CHECK-NEXT:    buffer_load_dword v13, v2, s[0:3], 0 offen offset:72
+; CHECK-NEXT:    buffer_load_dword v12, v2, s[0:3], 0 offen offset:68
+; CHECK-NEXT:    buffer_load_dword v11, v2, s[0:3], 0 offen offset:64
 ; CHECK-NEXT:    buffer_load_dword v22, v2, s[0:3], 0 offen offset:108
+; CHECK-NEXT:    buffer_load_dword v26, v2, s[0:3], 0 offen offset:124
+; CHECK-NEXT:    buffer_load_dword v25, v2, s[0:3], 0 offen offset:120
+; CHECK-NEXT:    buffer_load_dword v24, v2, s[0:3], 0 offen offset:116
+; CHECK-NEXT:    buffer_load_dword v23, v2, s[0:3], 0 offen offset:112
 ; CHECK-NEXT:    buffer_load_dword v21, v2, s[0:3], 0 offen offset:104
 ; CHECK-NEXT:    buffer_load_dword v20, v2, s[0:3], 0 offen offset:100
 ; CHECK-NEXT:    buffer_load_dword v19, v2, s[0:3], 0 offen offset:96
-; CHECK-NEXT:    buffer_load_dword v26, v2, s[0:3], 0 offen offset:252
-; CHECK-NEXT:    buffer_load_dword v25, v2, s[0:3], 0 offen offset:248
-; CHECK-NEXT:    buffer_load_dword v24, v2, s[0:3], 0 offen offset:244
-; CHECK-NEXT:    buffer_load_dword v23, v2, s[0:3], 0 offen offset:240
 ; CHECK-NEXT:    buffer_load_dword v30, v2, s[0:3], 0 offen offset:236
+; CHECK-NEXT:    buffer_load_dword v34, v2, s[0:3], 0 offen offset:252
+; CHECK-NEXT:    buffer_load_dword v33, v2, s[0:3], 0 offen offset:248
+; CHECK-NEXT:    buffer_load_dword v32, v2, s[0:3], 0 offen offset:244
+; CHECK-NEXT:    buffer_load_dword v31, v2, s[0:3], 0 offen offset:240
 ; CHECK-NEXT:    buffer_load_dword v29, v2, s[0:3], 0 offen offset:232
 ; CHECK-NEXT:    buffer_load_dword v28, v2, s[0:3], 0 offen offset:228
 ; CHECK-NEXT:    buffer_load_dword v27, v2, s[0:3], 0 offen offset:224
-; CHECK-NEXT:    buffer_load_dword v34, v2, s[0:3], 0 offen offset:220
-; CHECK-NEXT:    buffer_load_dword v33, v2, s[0:3], 0 offen offset:216
-; CHECK-NEXT:    buffer_load_dword v32, v2, s[0:3], 0 offen offset:212
-; CHECK-NEXT:    buffer_load_dword v31, v2, s[0:3], 0 offen offset:208
-; CHECK-NEXT:    buffer_load_dword v38, v2, s[0:3], 0 offen offset:204
-; CHECK-NEXT:    buffer_load_dword v37, v2, s[0:3], 0 offen offset:200
-; CHECK-NEXT:    buffer_load_dword v36, v2, s[0:3], 0 offen offset:196
-; CHECK-NEXT:    buffer_load_dword v35, v2, s[0:3], 0 offen offset:192
-; CHECK-NEXT:    buffer_load_dword v51, v2, s[0:3], 0 offen offset:188
-; CHECK-NEXT:    buffer_load_dword v50, v2, s[0:3], 0 offen offset:184
-; CHECK-NEXT:    buffer_load_dword v49, v2, s[0:3], 0 offen offset:180
-; CHECK-NEXT:    buffer_load_dword v48, v2, s[0:3], 0 offen offset:176
+; CHECK-NEXT:    buffer_load_dword v38, v2, s[0:3], 0 offen offset:220
+; CHECK-NEXT:    buffer_load_dword v37, v2, s[0:3], 0 offen offset:216
+; CHECK-NEXT:    buffer_load_dword v36, v2, s[0:3], 0 offen offset:212
+; CHECK-NEXT:    buffer_load_dword v35, v2, s[0:3], 0 offen offset:208
+; CHECK-NEXT:    buffer_load_dword v51, v2, s[0:3], 0 offen offset:204
+; CHECK-NEXT:    buffer_load_dword v50, v2, s[0:3], 0 offen offset:200
+; CHECK-NEXT:    buffer_load_dword v49, v2, s[0:3], 0 offen offset:196
+; CHECK-NEXT:    buffer_load_dword v48, v2, s[0:3], 0 offen offset:192
 ; CHECK-NEXT:    buffer_load_dword v55, v2, s[0:3], 0 offen offset:172
+; CHECK-NEXT:    buffer_load_dword v67, v2, s[0:3], 0 offen offset:188
+; CHECK-NEXT:    buffer_load_dword v66, v2, s[0:3], 0 offen offset:184
+; CHECK-NEXT:    buffer_load_dword v65, v2, s[0:3], 0 offen offset:180
+; CHECK-NEXT:    buffer_load_dword v64, v2, s[0:3], 0 offen offset:176
 ; CHECK-NEXT:    buffer_load_dword v54, v2, s[0:3], 0 offen offset:168
 ; CHECK-NEXT:    buffer_load_dword v53, v2, s[0:3], 0 offen offset:164
 ; CHECK-NEXT:    buffer_load_dword v52, v2, s[0:3], 0 offen offset:160
-; CHECK-NEXT:    buffer_load_dword v67, v2, s[0:3], 0 offen offset:156
-; CHECK-NEXT:    buffer_load_dword v66, v2, s[0:3], 0 offen offset:152
-; CHECK-NEXT:    buffer_load_dword v65, v2, s[0:3], 0 offen offset:148
-; CHECK-NEXT:    buffer_load_dword v64, v2, s[0:3], 0 offen offset:144
-; CHECK-NEXT:    buffer_load_dword v71, v2, s[0:3], 0 offen offset:140
-; CHECK-NEXT:    buffer_load_dword v70, v2, s[0:3], 0 offen offset:136
-; CHECK-NEXT:    buffer_load_dword v69, v2, s[0:3], 0 offen offset:132
-; CHECK-NEXT:    buffer_load_dword v68, v2, s[0:3], 0 offen offset:128
-; CHECK-NEXT:    buffer_load_dword v83, v2, s[0:3], 0 offen offset:92
-; CHECK-NEXT:    buffer_load_dword v82, v2, s[0:3], 0 offen offset:88
-; CHECK-NEXT:    buffer_load_dword v81, v2, s[0:3], 0 offen offset:84
-; CHECK-NEXT:    buffer_load_dword v80, v2, s[0:3], 0 offen offset:80
-; CHECK-NEXT:    buffer_load_dword v87, v2, s[0:3], 0 offen offset:76
-; CHECK-NEXT:    buffer_load_dword v86, v2, s[0:3], 0 offen offset:72
-; CHECK-NEXT:    buffer_load_dword v85, v2, s[0:3], 0 offen offset:68
-; CHECK-NEXT:    buffer_load_dword v84, v2, s[0:3], 0 offen offset:64
-; CHECK-NEXT:    buffer_load_dword v96, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v97, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v98, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v99, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_dword v71, v2, s[0:3], 0 offen offset:156
+; CHECK-NEXT:    buffer_load_dword v70, v2, s[0:3], 0 offen offset:152
+; CHECK-NEXT:    buffer_load_dword v69, v2, s[0:3], 0 offen offset:148
+; CHECK-NEXT:    buffer_load_dword v68, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT:    buffer_load_dword v83, v2, s[0:3], 0 offen offset:140
+; CHECK-NEXT:    buffer_load_dword v82, v2, s[0:3], 0 offen offset:136
+; CHECK-NEXT:    buffer_load_dword v81, v2, s[0:3], 0 offen offset:132
+; CHECK-NEXT:    buffer_load_dword v80, v2, s[0:3], 0 offen offset:128
+; CHECK-NEXT:    buffer_load_dword v84, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v85, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v86, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v96, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v97, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v98, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v99, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v87, v2, s[0:3], 0 offen offset:12
 ; CHECK-NEXT:    v_add_co_u32 v100, vcc_lo, v0, s4
 ; CHECK-NEXT:    s_add_u32 s4, s4, 0x100
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v101, null, s5, v1, vcc_lo
 ; CHECK-NEXT:    s_addc_u32 s5, s5, 0
 ; CHECK-NEXT:    v_add_nc_u32_e32 v2, 0x100, v2
 ; CHECK-NEXT:    v_cmp_gt_u64_e64 s6, 0x800, s[4:5]
-; CHECK-NEXT:    s_waitcnt vmcnt(41)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[23:26] offset:240
-; CHECK-NEXT:    s_waitcnt vmcnt(37)
+; CHECK-NEXT:    s_waitcnt vmcnt(35)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[31:34] offset:240
+; CHECK-NEXT:    s_waitcnt vmcnt(32)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[27:30] offset:224
-; CHECK-NEXT:    s_waitcnt vmcnt(33)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[31:34] offset:208
-; CHECK-NEXT:    s_waitcnt vmcnt(29)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[35:38] offset:192
-; CHECK-NEXT:    s_waitcnt vmcnt(25)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[48:51] offset:176
-; CHECK-NEXT:    s_waitcnt vmcnt(21)
+; CHECK-NEXT:    s_waitcnt vmcnt(28)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[35:38] offset:208
+; CHECK-NEXT:    s_waitcnt vmcnt(24)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[48:51] offset:192
+; CHECK-NEXT:    s_waitcnt vmcnt(19)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[64:67] offset:176
+; CHECK-NEXT:    s_waitcnt vmcnt(16)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[52:55] offset:160
-; CHECK-NEXT:    s_waitcnt vmcnt(17)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[64:67] offset:144
-; CHECK-NEXT:    s_waitcnt vmcnt(13)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[68:71] offset:128
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[15:18] offset:112
+; CHECK-NEXT:    s_waitcnt vmcnt(12)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[68:71] offset:144
+; CHECK-NEXT:    s_waitcnt vmcnt(8)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[80:83] offset:128
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[23:26] offset:112
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[19:22] offset:96
-; CHECK-NEXT:    s_waitcnt vmcnt(9)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[80:83] offset:80
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87] offset:64
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[11:14] offset:48
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[7:10] offset:32
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[15:18] offset:80
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[11:14] offset:64
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[7:10] offset:48
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[3:6] offset:32
 ; CHECK-NEXT:    s_waitcnt vmcnt(1)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[3:6] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87]
 ; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB4_1
 ; CHECK-NEXT:  ; %bb.2: ; %memcpy-split
@@ -3748,16 +3746,17 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v1, v2, s[0:3], 0 offen offset:21
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:22
 ; ALIGNED-NEXT:    buffer_load_ubyte v4, v2, s[0:3], 0 offen offset:23
-; ALIGNED-NEXT:    buffer_load_ubyte v7, v2, s[0:3], 0 offen offset:24
+; ALIGNED-NEXT:    buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:24
 ; ALIGNED-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:25
 ; ALIGNED-NEXT:    buffer_load_ubyte v12, v2, s[0:3], 0 offen offset:26
-; ALIGNED-NEXT:    buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:30
-; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:31
+; ALIGNED-NEXT:    buffer_load_ubyte v127, v2, s[0:3], 0 offen offset:19
+; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:28
+; ALIGNED-NEXT:    buffer_load_ubyte v7, v2, s[0:3], 0 offen offset:29
+; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:30
+; ALIGNED-NEXT:    buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:31
 ; ALIGNED-NEXT:    buffer_load_ubyte v14, v2, s[0:3], 0 offen offset:32
 ; ALIGNED-NEXT:    buffer_load_ubyte v15, v2, s[0:3], 0 offen offset:33
 ; ALIGNED-NEXT:    buffer_load_ubyte v17, v2, s[0:3], 0 offen offset:34
-; ALIGNED-NEXT:    buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:29
-; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:28
 ; ALIGNED-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:27
 ; ALIGNED-NEXT:    buffer_load_ubyte v19, v2, s[0:3], 0 offen offset:35
 ; ALIGNED-NEXT:    buffer_load_ubyte v13, v2, s[0:3], 0 offen offset:36
@@ -3798,10 +3797,9 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v68, v2, s[0:3], 0 offen offset:71
 ; ALIGNED-NEXT:    buffer_load_ubyte v69, v2, s[0:3], 0 offen offset:76
 ; ALIGNED-NEXT:    buffer_load_ubyte v70, v2, s[0:3], 0 offen offset:77
+; ALIGNED-NEXT:    buffer_load_ubyte v81, v2, s[0:3], 0 offen offset:75
 ; ALIGNED-NEXT:    buffer_load_ubyte v71, v2, s[0:3], 0 offen offset:78
 ; ALIGNED-NEXT:    buffer_load_ubyte v80, v2, s[0:3], 0 offen offset:79
-; ALIGNED-NEXT:    buffer_load_ubyte v127, v2, s[0:3], 0 offen offset:19
-; ALIGNED-NEXT:    buffer_load_ubyte v81, v2, s[0:3], 0 offen offset:75
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(57)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:448 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(56)
@@ -3811,46 +3809,46 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(54)
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:460 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(53)
-; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:468 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(52)
 ; ALIGNED-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:484 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(51)
 ; ALIGNED-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:492 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(50)
-; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(49)
-; ALIGNED-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(48)
+; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(47)
+; ALIGNED-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:476 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(46)
+; ALIGNED-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:480 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(45)
 ; ALIGNED-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:504 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v4, 8, v3
-; ALIGNED-NEXT:    s_waitcnt vmcnt(45)
-; ALIGNED-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:472 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(44)
-; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:464 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(43)
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v7, 8, v5
+; ALIGNED-NEXT:    s_waitcnt vmcnt(42)
 ; ALIGNED-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:488 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v9, 8, v5
-; ALIGNED-NEXT:    s_waitcnt vmcnt(41)
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v9, 8, v8
+; ALIGNED-NEXT:    s_waitcnt vmcnt(40)
 ; ALIGNED-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:496 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v4, v8, 8, v6
-; ALIGNED-NEXT:    v_lshl_or_b32 v5, v10, 8, v7
+; ALIGNED-NEXT:    v_lshl_or_b32 v5, v10, 8, v6
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v11, 8, v12
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v15, 8, v14
 ; ALIGNED-NEXT:    v_lshl_or_b32 v8, v19, 8, v17
-; ALIGNED-NEXT:    s_waitcnt vmcnt(40)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(39)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v9, v16, 8, v13
-; ALIGNED-NEXT:    s_waitcnt vmcnt(38)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(37)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v10, v20, 8, v18
-; ALIGNED-NEXT:    s_waitcnt vmcnt(36)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(35)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v11, v23, 8, v22
-; ALIGNED-NEXT:    s_waitcnt vmcnt(34)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(33)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v12, v28, 8, v25
-; ALIGNED-NEXT:    s_waitcnt vmcnt(32)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(31)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v13, v24, 8, v21
-; ALIGNED-NEXT:    s_waitcnt vmcnt(30)
-; ALIGNED-NEXT:    v_lshl_or_b32 v14, v27, 8, v26
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
+; ALIGNED-NEXT:    s_waitcnt vmcnt(29)
+; ALIGNED-NEXT:    v_lshl_or_b32 v14, v27, 8, v26
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v6, 16, v5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v8, 16, v7
@@ -3858,27 +3856,27 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v12, 16, v11
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v14, 16, v13
 ; ALIGNED-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:508 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(28)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(27)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v15, v31, 8, v30
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:516 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(26)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(25)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v34, 8, v33
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:532 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(24)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(23)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v37, 8, v32
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:536 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(22)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(21)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v36, 8, v35
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:576 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(17)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(16)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v50, 8, v38
 ; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:588 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(15)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(14)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v5, v49, 8, v39
 ; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:604 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v51, 8, v48
 ; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:616 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(11)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(10)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v53, 8, v52
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v0, 16, v15
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v3, 16, v1
@@ -3888,13 +3886,13 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:652 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v55, 8, v29
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:656 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(11)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(10)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v67, 8, v66
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:664 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(9)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v64, 8, v54
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:668 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v68, 8, v65
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_clause 0x1
@@ -3903,13 +3901,13 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:524 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v4, 16, v3
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v70, 8, v69
 ; ALIGNED-NEXT:    s_clause 0x1
 ; ALIGNED-NEXT:    buffer_load_ubyte v4, v2, s[0:3], 0 offen offset:83
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:74
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(5)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v80, 8, v71
 ; ALIGNED-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:528 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:500 ; 4-byte Folded Spill
@@ -3955,9 +3953,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v70, off, s[0:3], s32 offset:704 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v71, off, s[0:3], s32 offset:708 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v80, off, s[0:3], s32 offset:716 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
 ; ALIGNED-NEXT:    buffer_store_dword v127, off, s[0:3], s32 offset:1152 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
 ; ALIGNED-NEXT:    buffer_store_dword v81, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:87
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
@@ -4240,7 +4236,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:148
 ; ALIGNED-NEXT:    buffer_load_ubyte v4, v2, s[0:3], 0 offen offset:149
 ; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:150
-; ALIGNED-NEXT:    buffer_load_ubyte v123, v2, s[0:3], 0 offen offset:151
+; ALIGNED-NEXT:    buffer_load_ubyte v124, v2, s[0:3], 0 offen offset:151
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v126, 8, v125
@@ -4252,7 +4248,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:1116 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v123, 8, v5
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v124, 8, v5
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1124 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v4, 8, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
@@ -4286,7 +4282,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v74, v2, s[0:3], 0 offen offset:163
 ; ALIGNED-NEXT:    buffer_load_ubyte v79, v2, s[0:3], 0 offen offset:164
 ; ALIGNED-NEXT:    buffer_load_ubyte v75, v2, s[0:3], 0 offen offset:165
-; ALIGNED-NEXT:    buffer_load_ubyte v76, v2, s[0:3], 0 offen offset:166
+; ALIGNED-NEXT:    buffer_load_ubyte v77, v2, s[0:3], 0 offen offset:166
 ; ALIGNED-NEXT:    buffer_load_ubyte v72, v2, s[0:3], 0 offen offset:167
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v78, 8, v89
@@ -4294,7 +4290,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v74, 8, v73
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v72, 8, v76
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v72, 8, v77
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1156 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v75, 8, v79
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
@@ -4303,20 +4299,20 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v63, v2, s[0:3], 0 offen offset:172
 ; ALIGNED-NEXT:    buffer_load_ubyte v61, v2, s[0:3], 0 offen offset:173
 ; ALIGNED-NEXT:    buffer_load_ubyte v62, v2, s[0:3], 0 offen offset:174
-; ALIGNED-NEXT:    buffer_load_ubyte v60, v2, s[0:3], 0 offen offset:175
+; ALIGNED-NEXT:    buffer_load_ubyte v59, v2, s[0:3], 0 offen offset:175
 ; ALIGNED-NEXT:    buffer_load_ubyte v57, v2, s[0:3], 0 offen offset:171
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v61, 8, v63
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v60, 8, v62
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v59, 8, v62
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1164 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
-; ALIGNED-NEXT:    buffer_load_ubyte v59, v2, s[0:3], 0 offen offset:168
+; ALIGNED-NEXT:    buffer_load_ubyte v58, v2, s[0:3], 0 offen offset:168
 ; ALIGNED-NEXT:    buffer_load_ubyte v56, v2, s[0:3], 0 offen offset:169
 ; ALIGNED-NEXT:    buffer_load_ubyte v47, v2, s[0:3], 0 offen offset:170
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v56, 8, v59
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v56, 8, v58
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v57, 8, v47
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
@@ -4326,7 +4322,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v43, v2, s[0:3], 0 offen offset:177
 ; ALIGNED-NEXT:    buffer_load_ubyte v119, v2, s[0:3], 0 offen offset:178
 ; ALIGNED-NEXT:    buffer_load_ubyte v40, v2, s[0:3], 0 offen offset:179
-; ALIGNED-NEXT:    buffer_load_ubyte v45, v2, s[0:3], 0 offen offset:180
+; ALIGNED-NEXT:    buffer_load_ubyte v44, v2, s[0:3], 0 offen offset:180
 ; ALIGNED-NEXT:    buffer_load_ubyte v41, v2, s[0:3], 0 offen offset:181
 ; ALIGNED-NEXT:    buffer_load_ubyte v42, v2, s[0:3], 0 offen offset:182
 ; ALIGNED-NEXT:    buffer_load_ubyte v118, v2, s[0:3], 0 offen offset:183
@@ -4338,7 +4334,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v118, 8, v42
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1172 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v41, 8, v45
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v41, 8, v44
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1176 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
@@ -4373,15 +4369,16 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v96, v2, s[0:3], 0 offen offset:198
 ; ALIGNED-NEXT:    buffer_load_ubyte v85, v2, s[0:3], 0 offen offset:199
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v98, 8, v100
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v98, 8, v100
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v86, 8, v87
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v3, 16, v0
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v86, 8, v87
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v4, 16, v3
+; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v97, 8, v99
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v85, 8, v96
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v85, 8, v96
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1188 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v97, 8, v99
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v3, 16, v0
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v4, 16, v3
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1192 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
 ; ALIGNED-NEXT:    buffer_load_ubyte v83, v2, s[0:3], 0 offen offset:204
@@ -4492,23 +4489,23 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:248
 ; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:249
 ; ALIGNED-NEXT:    buffer_load_ubyte v1, v2, s[0:3], 0 offen offset:250
-; ALIGNED-NEXT:    v_lshl_or_b32 v124, v4, 16, v3
+; ALIGNED-NEXT:    v_lshl_or_b32 v123, v4, 16, v3
 ; ALIGNED-NEXT:    s_clause 0x5
 ; ALIGNED-NEXT:    buffer_load_ubyte v0, v2, s[0:3], 0 offen
 ; ALIGNED-NEXT:    buffer_load_ubyte v94, v2, s[0:3], 0 offen offset:2
 ; ALIGNED-NEXT:    buffer_load_ubyte v88, v2, s[0:3], 0 offen offset:4
 ; ALIGNED-NEXT:    buffer_load_ubyte v90, v2, s[0:3], 0 offen offset:5
 ; ALIGNED-NEXT:    buffer_load_ubyte v92, v2, s[0:3], 0 offen offset:6
-; ALIGNED-NEXT:    buffer_load_ubyte v95, v2, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT:    buffer_load_ubyte v104, v2, s[0:3], 0 offen offset:7
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(28)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v25, 8, v27
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(26)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v24, 8, v26
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(14)
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v12, 8, v16
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v12, 8, v16
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(10)
-; ALIGNED-NEXT:    v_lshl_or_b32 v58, v8, 8, v10
-; ALIGNED-NEXT:    v_lshl_or_b32 v104, v4, 16, v3
+; ALIGNED-NEXT:    v_lshl_or_b32 v60, v8, 8, v10
+; ALIGNED-NEXT:    v_lshl_or_b32 v95, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v21, 8, v22
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v23, 8, v20
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
@@ -4517,35 +4514,35 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v90, off, s[0:3], s32 offset:1096 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
 ; ALIGNED-NEXT:    buffer_store_dword v92, off, s[0:3], s32 offset:1100 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v77, v4, 16, v3
+; ALIGNED-NEXT:    v_lshl_or_b32 v76, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v17, 8, v19
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v14, 8, v13
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    buffer_store_dword v95, off, s[0:3], s32 offset:1108 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v104, off, s[0:3], s32 offset:1108 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v101, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v15, 8, v18
-; ALIGNED-NEXT:    v_lshl_or_b32 v84, v44, 16, v4
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v9, 8, v11
-; ALIGNED-NEXT:    v_lshl_or_b32 v4, v58, 16, v44
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v5, 8, v6
-; ALIGNED-NEXT:    v_lshl_or_b32 v58, v7, 8, v1
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v58, 16, v44
+; ALIGNED-NEXT:    v_lshl_or_b32 v84, v45, 16, v4
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v9, 8, v11
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v60, 16, v45
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v5, 8, v6
+; ALIGNED-NEXT:    v_lshl_or_b32 v60, v7, 8, v1
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v60, 16, v45
 ; ALIGNED-NEXT:    s_clause 0x1
-; ALIGNED-NEXT:    buffer_load_ubyte v44, v2, s[0:3], 0 offen offset:1
-; ALIGNED-NEXT:    buffer_load_ubyte v58, v2, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT:    buffer_load_ubyte v45, v2, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT:    buffer_load_ubyte v60, v2, s[0:3], 0 offen offset:3
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1068 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v94, off, s[0:3], s32 offset:1092 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    buffer_store_dword v44, off, s[0:3], s32 offset:1076 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:1076 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    buffer_store_dword v58, off, s[0:3], s32 offset:1080 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v44, 8, v0
-; ALIGNED-NEXT:    v_lshl_or_b32 v58, v58, 8, v94
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v58, 16, v44
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v90, 8, v88
-; ALIGNED-NEXT:    v_lshl_or_b32 v58, v95, 8, v92
+; ALIGNED-NEXT:    buffer_store_dword v60, off, s[0:3], s32 offset:1080 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v45, 8, v0
+; ALIGNED-NEXT:    v_lshl_or_b32 v60, v60, 8, v94
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v60, 16, v45
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v90, 8, v88
+; ALIGNED-NEXT:    v_lshl_or_b32 v60, v104, 8, v92
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1120 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v58, 16, v44
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v60, 16, v45
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1128 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
 ; ALIGNED-NEXT:    buffer_load_ubyte v122, v2, s[0:3], 0 offen offset:12
@@ -4554,34 +4551,34 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_ubyte v110, v2, s[0:3], 0 offen offset:15
 ; ALIGNED-NEXT:    buffer_load_ubyte v94, v2, s[0:3], 0 offen offset:11
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v111, 8, v122
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v111, 8, v122
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v58, v110, 8, v120
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v58, 16, v44
+; ALIGNED-NEXT:    v_lshl_or_b32 v60, v110, 8, v120
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v60, 16, v45
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1140 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
-; ALIGNED-NEXT:    buffer_load_ubyte v95, v2, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT:    buffer_load_ubyte v104, v2, s[0:3], 0 offen offset:8
 ; ALIGNED-NEXT:    buffer_load_ubyte v92, v2, s[0:3], 0 offen offset:9
 ; ALIGNED-NEXT:    buffer_load_ubyte v90, v2, s[0:3], 0 offen offset:10
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v44, v92, 8, v95
+; ALIGNED-NEXT:    v_lshl_or_b32 v45, v92, 8, v104
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v58, v94, 8, v90
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v58, 16, v44
+; ALIGNED-NEXT:    v_lshl_or_b32 v60, v94, 8, v90
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v60, 16, v45
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1148 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
+; ALIGNED-NEXT:    buffer_load_ubyte v60, v2, s[0:3], 0 offen offset:18
 ; ALIGNED-NEXT:    buffer_load_ubyte v88, v2, s[0:3], 0 offen offset:16
-; ALIGNED-NEXT:    buffer_load_ubyte v44, v2, s[0:3], 0 offen offset:18
-; ALIGNED-NEXT:    buffer_load_ubyte v58, v2, s[0:3], 0 offen offset:17
+; ALIGNED-NEXT:    buffer_load_ubyte v45, v2, s[0:3], 0 offen offset:17
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:232
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:236
 ; ALIGNED-NEXT:    buffer_store_dword v84, off, s[0:3], s32 offset:228
 ; ALIGNED-NEXT:    buffer_store_dword v101, off, s[0:3], s32 offset:224
 ; ALIGNED-NEXT:    v_add_nc_u32_e32 v2, 0x100, v2
-; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v127, 8, v44
+; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v127, 8, v60
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v127, v58, 8, v88
+; ALIGNED-NEXT:    v_lshl_or_b32 v127, v45, 8, v88
 ; ALIGNED-NEXT:    v_lshl_or_b32 v127, v0, 16, v127
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1228 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
@@ -4606,9 +4603,9 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v16 offset:246
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v18 offset:244
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v19 offset:240
-; ALIGNED-NEXT:    buffer_store_dword v77, off, s[0:3], s32 offset:248
-; ALIGNED-NEXT:    buffer_store_dword v104, off, s[0:3], s32 offset:252
-; ALIGNED-NEXT:    buffer_store_dword v124, off, s[0:3], s32 offset:244
+; ALIGNED-NEXT:    buffer_store_dword v76, off, s[0:3], s32 offset:248
+; ALIGNED-NEXT:    buffer_store_dword v95, off, s[0:3], s32 offset:252
+; ALIGNED-NEXT:    buffer_store_dword v123, off, s[0:3], s32 offset:244
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1220 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_addc_u32 s5, s5, 0
 ; ALIGNED-NEXT:    v_cmp_gt_u64_e64 s6, 0x800, s[4:5]
@@ -4713,7 +4710,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v118 offset:183
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v41 offset:181
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v42 offset:182
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v45 offset:180
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v44 offset:180
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v46 offset:176
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1168 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
@@ -4730,17 +4727,17 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v47 offset:170
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v57 offset:171
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v56 offset:169
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v60 offset:175
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v59 offset:175
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v61 offset:173
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v62 offset:174
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v63 offset:172
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v59 offset:168
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v58 offset:168
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v73 offset:162
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v74 offset:163
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v78 offset:161
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v72 offset:167
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v75 offset:165
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v76 offset:166
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v77 offset:166
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v79 offset:164
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v89 offset:160
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1144 ; 4-byte Folded Reload
@@ -4768,7 +4765,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1084 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:145
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v123 offset:151
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v124 offset:151
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1112 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:149
@@ -5235,11 +5232,11 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:468 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:24
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v44 offset:18
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v60 offset:18
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1152 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:19
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v58 offset:17
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v45 offset:17
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:460 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:23
@@ -5272,7 +5269,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v110 offset:15
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v120 offset:14
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v122 offset:12
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v95 offset:8
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v104 offset:8
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1092 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:2
@@ -12461,97 +12458,97 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; CHECK-NEXT:  .LBB9_1: ; %memmove_fwd_loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3e
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:32
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:36
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:40
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:44
-; CHECK-NEXT:    buffer_load_dword v11, v2, s[0:3], 0 offen offset:48
-; CHECK-NEXT:    buffer_load_dword v12, v2, s[0:3], 0 offen offset:52
-; CHECK-NEXT:    buffer_load_dword v13, v2, s[0:3], 0 offen offset:56
-; CHECK-NEXT:    buffer_load_dword v14, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:32
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:36
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:40
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:44
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:48
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:52
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:56
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v14, v2, s[0:3], 0 offen offset:76
 ; CHECK-NEXT:    buffer_load_dword v18, v2, s[0:3], 0 offen offset:92
 ; CHECK-NEXT:    buffer_load_dword v17, v2, s[0:3], 0 offen offset:88
 ; CHECK-NEXT:    buffer_load_dword v16, v2, s[0:3], 0 offen offset:84
 ; CHECK-NEXT:    buffer_load_dword v15, v2, s[0:3], 0 offen offset:80
-; CHECK-NEXT:    buffer_load_dword v22, v2, s[0:3], 0 offen offset:124
-; CHECK-NEXT:    buffer_load_dword v21, v2, s[0:3], 0 offen offset:120
-; CHECK-NEXT:    buffer_load_dword v20, v2, s[0:3], 0 offen offset:116
-; CHECK-NEXT:    buffer_load_dword v19, v2, s[0:3], 0 offen offset:112
-; CHECK-NEXT:    buffer_load_dword v26, v2, s[0:3], 0 offen offset:108
-; CHECK-NEXT:    buffer_load_dword v25, v2, s[0:3], 0 offen offset:104
-; CHECK-NEXT:    buffer_load_dword v24, v2, s[0:3], 0 offen offset:100
-; CHECK-NEXT:    buffer_load_dword v23, v2, s[0:3], 0 offen offset:96
-; CHECK-NEXT:    buffer_load_dword v30, v2, s[0:3], 0 offen offset:156
-; CHECK-NEXT:    buffer_load_dword v29, v2, s[0:3], 0 offen offset:152
-; CHECK-NEXT:    buffer_load_dword v28, v2, s[0:3], 0 offen offset:148
-; CHECK-NEXT:    buffer_load_dword v27, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT:    buffer_load_dword v13, v2, s[0:3], 0 offen offset:72
+; CHECK-NEXT:    buffer_load_dword v12, v2, s[0:3], 0 offen offset:68
+; CHECK-NEXT:    buffer_load_dword v11, v2, s[0:3], 0 offen offset:64
+; CHECK-NEXT:    buffer_load_dword v22, v2, s[0:3], 0 offen offset:108
+; CHECK-NEXT:    buffer_load_dword v26, v2, s[0:3], 0 offen offset:124
+; CHECK-NEXT:    buffer_load_dword v25, v2, s[0:3], 0 offen offset:120
+; CHECK-NEXT:    buffer_load_dword v24, v2, s[0:3], 0 offen offset:116
+; CHECK-NEXT:    buffer_load_dword v23, v2, s[0:3], 0 offen offset:112
+; CHECK-NEXT:    buffer_load_dword v21, v2, s[0:3], 0 offen offset:104
+; CHECK-NEXT:    buffer_load_dword v20, v2, s[0:3], 0 offen offset:100
+; CHECK-NEXT:    buffer_load_dword v19, v2, s[0:3], 0 offen offset:96
+; CHECK-NEXT:    buffer_load_dword v30, v2, s[0:3], 0 offen offset:172
 ; CHECK-NEXT:    buffer_load_dword v34, v2, s[0:3], 0 offen offset:188
 ; CHECK-NEXT:    buffer_load_dword v33, v2, s[0:3], 0 offen offset:184
 ; CHECK-NEXT:    buffer_load_dword v32, v2, s[0:3], 0 offen offset:180
 ; CHECK-NEXT:    buffer_load_dword v31, v2, s[0:3], 0 offen offset:176
-; CHECK-NEXT:    buffer_load_dword v38, v2, s[0:3], 0 offen offset:172
-; CHECK-NEXT:    buffer_load_dword v37, v2, s[0:3], 0 offen offset:168
-; CHECK-NEXT:    buffer_load_dword v36, v2, s[0:3], 0 offen offset:164
-; CHECK-NEXT:    buffer_load_dword v35, v2, s[0:3], 0 offen offset:160
+; CHECK-NEXT:    buffer_load_dword v29, v2, s[0:3], 0 offen offset:168
+; CHECK-NEXT:    buffer_load_dword v28, v2, s[0:3], 0 offen offset:164
+; CHECK-NEXT:    buffer_load_dword v27, v2, s[0:3], 0 offen offset:160
+; CHECK-NEXT:    buffer_load_dword v38, v2, s[0:3], 0 offen offset:204
 ; CHECK-NEXT:    buffer_load_dword v51, v2, s[0:3], 0 offen offset:220
 ; CHECK-NEXT:    buffer_load_dword v50, v2, s[0:3], 0 offen offset:216
 ; CHECK-NEXT:    buffer_load_dword v49, v2, s[0:3], 0 offen offset:212
 ; CHECK-NEXT:    buffer_load_dword v48, v2, s[0:3], 0 offen offset:208
-; CHECK-NEXT:    buffer_load_dword v55, v2, s[0:3], 0 offen offset:252
-; CHECK-NEXT:    buffer_load_dword v54, v2, s[0:3], 0 offen offset:248
-; CHECK-NEXT:    buffer_load_dword v53, v2, s[0:3], 0 offen offset:244
-; CHECK-NEXT:    buffer_load_dword v52, v2, s[0:3], 0 offen offset:240
-; CHECK-NEXT:    buffer_load_dword v67, v2, s[0:3], 0 offen offset:236
-; CHECK-NEXT:    buffer_load_dword v66, v2, s[0:3], 0 offen offset:232
-; CHECK-NEXT:    buffer_load_dword v65, v2, s[0:3], 0 offen offset:228
-; CHECK-NEXT:    buffer_load_dword v64, v2, s[0:3], 0 offen offset:224
-; CHECK-NEXT:    buffer_load_dword v71, v2, s[0:3], 0 offen offset:204
-; CHECK-NEXT:    buffer_load_dword v70, v2, s[0:3], 0 offen offset:200
-; CHECK-NEXT:    buffer_load_dword v69, v2, s[0:3], 0 offen offset:196
-; CHECK-NEXT:    buffer_load_dword v68, v2, s[0:3], 0 offen offset:192
-; CHECK-NEXT:    buffer_load_dword v83, v2, s[0:3], 0 offen offset:140
-; CHECK-NEXT:    buffer_load_dword v82, v2, s[0:3], 0 offen offset:136
-; CHECK-NEXT:    buffer_load_dword v81, v2, s[0:3], 0 offen offset:132
-; CHECK-NEXT:    buffer_load_dword v80, v2, s[0:3], 0 offen offset:128
-; CHECK-NEXT:    buffer_load_dword v87, v2, s[0:3], 0 offen offset:76
-; CHECK-NEXT:    buffer_load_dword v86, v2, s[0:3], 0 offen offset:72
-; CHECK-NEXT:    buffer_load_dword v85, v2, s[0:3], 0 offen offset:68
-; CHECK-NEXT:    buffer_load_dword v84, v2, s[0:3], 0 offen offset:64
-; CHECK-NEXT:    buffer_load_dword v96, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v97, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v98, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v99, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_dword v37, v2, s[0:3], 0 offen offset:200
+; CHECK-NEXT:    buffer_load_dword v36, v2, s[0:3], 0 offen offset:196
+; CHECK-NEXT:    buffer_load_dword v35, v2, s[0:3], 0 offen offset:192
+; CHECK-NEXT:    buffer_load_dword v55, v2, s[0:3], 0 offen offset:236
+; CHECK-NEXT:    buffer_load_dword v67, v2, s[0:3], 0 offen offset:252
+; CHECK-NEXT:    buffer_load_dword v66, v2, s[0:3], 0 offen offset:248
+; CHECK-NEXT:    buffer_load_dword v65, v2, s[0:3], 0 offen offset:244
+; CHECK-NEXT:    buffer_load_dword v64, v2, s[0:3], 0 offen offset:240
+; CHECK-NEXT:    buffer_load_dword v54, v2, s[0:3], 0 offen offset:232
+; CHECK-NEXT:    buffer_load_dword v53, v2, s[0:3], 0 offen offset:228
+; CHECK-NEXT:    buffer_load_dword v52, v2, s[0:3], 0 offen offset:224
+; CHECK-NEXT:    buffer_load_dword v71, v2, s[0:3], 0 offen offset:140
+; CHECK-NEXT:    buffer_load_dword v83, v2, s[0:3], 0 offen offset:156
+; CHECK-NEXT:    buffer_load_dword v82, v2, s[0:3], 0 offen offset:152
+; CHECK-NEXT:    buffer_load_dword v81, v2, s[0:3], 0 offen offset:148
+; CHECK-NEXT:    buffer_load_dword v80, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT:    buffer_load_dword v70, v2, s[0:3], 0 offen offset:136
+; CHECK-NEXT:    buffer_load_dword v69, v2, s[0:3], 0 offen offset:132
+; CHECK-NEXT:    buffer_load_dword v68, v2, s[0:3], 0 offen offset:128
+; CHECK-NEXT:    buffer_load_dword v84, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v85, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v86, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v96, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v97, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v98, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v99, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v87, v2, s[0:3], 0 offen offset:12
 ; CHECK-NEXT:    v_add_co_u32 v100, vcc_lo, v0, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v101, null, s5, v1, vcc_lo
 ; CHECK-NEXT:    s_add_u32 s4, s4, 0x100
 ; CHECK-NEXT:    v_add_nc_u32_e32 v2, 0x100, v2
 ; CHECK-NEXT:    s_addc_u32 s5, s5, 0
-; CHECK-NEXT:    s_waitcnt vmcnt(20)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[52:55] offset:240
+; CHECK-NEXT:    s_waitcnt vmcnt(19)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[64:67] offset:240
 ; CHECK-NEXT:    s_waitcnt vmcnt(16)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[64:67] offset:224
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[52:55] offset:224
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[48:51] offset:208
-; CHECK-NEXT:    s_waitcnt vmcnt(12)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[68:71] offset:192
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[35:38] offset:192
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[31:34] offset:176
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[35:38] offset:160
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[27:30] offset:144
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[27:30] offset:160
+; CHECK-NEXT:    s_waitcnt vmcnt(11)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[80:83] offset:144
 ; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[80:83] offset:128
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[19:22] offset:112
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[23:26] offset:96
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[68:71] offset:128
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[23:26] offset:112
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[19:22] offset:96
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[15:18] offset:80
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87] offset:64
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[11:14] offset:48
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[7:10] offset:32
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[3:6] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[11:14] offset:64
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[7:10] offset:48
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[3:6] offset:32
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87]
 ; CHECK-NEXT:    s_cmp_lg_u64 s[4:5], 0x800
 ; CHECK-NEXT:    s_cbranch_scc1 .LBB9_1
 ; CHECK-NEXT:  .LBB9_2: ; %Flow10
@@ -12565,103 +12562,101 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; CHECK-NEXT:  .LBB9_4: ; %memmove_bwd_loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3e
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:32
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:36
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:40
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:44
-; CHECK-NEXT:    buffer_load_dword v11, v2, s[0:3], 0 offen offset:48
-; CHECK-NEXT:    buffer_load_dword v12, v2, s[0:3], 0 offen offset:52
-; CHECK-NEXT:    buffer_load_dword v13, v2, s[0:3], 0 offen offset:56
-; CHECK-NEXT:    buffer_load_dword v14, v2, s[0:3], 0 offen offset:60
-; CHECK-NEXT:    buffer_load_dword v18, v2, s[0:3], 0 offen offset:124
-; CHECK-NEXT:    buffer_load_dword v17, v2, s[0:3], 0 offen offset:120
-; CHECK-NEXT:    buffer_load_dword v16, v2, s[0:3], 0 offen offset:116
-; CHECK-NEXT:    buffer_load_dword v15, v2, s[0:3], 0 offen offset:112
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:32
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:36
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:40
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:44
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:48
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:52
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:56
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT:    buffer_load_dword v14, v2, s[0:3], 0 offen offset:76
+; CHECK-NEXT:    buffer_load_dword v18, v2, s[0:3], 0 offen offset:92
+; CHECK-NEXT:    buffer_load_dword v17, v2, s[0:3], 0 offen offset:88
+; CHECK-NEXT:    buffer_load_dword v16, v2, s[0:3], 0 offen offset:84
+; CHECK-NEXT:    buffer_load_dword v15, v2, s[0:3], 0 offen offset:80
+; CHECK-NEXT:    buffer_load_dword v13, v2, s[0:3], 0 offen offset:72
+; CHECK-NEXT:    buffer_load_dword v12, v2, s[0:3], 0 offen offset:68
+; CHECK-NEXT:    buffer_load_dword v11, v2, s[0:3], 0 offen offset:64
 ; CHECK-NEXT:    buffer_load_dword v22, v2, s[0:3], 0 offen offset:108
+; CHECK-NEXT:    buffer_load_dword v26, v2, s[0:3], 0 offen offset:124
+; CHECK-NEXT:    buffer_load_dword v25, v2, s[0:3], 0 offen offset:120
+; CHECK-NEXT:    buffer_load_dword v24, v2, s[0:3], 0 offen offset:116
+; CHECK-NEXT:    buffer_load_dword v23, v2, s[0:3], 0 offen offset:112
 ; CHECK-NEXT:    buffer_load_dword v21, v2, s[0:3], 0 offen offset:104
 ; CHECK-NEXT:    buffer_load_dword v20, v2, s[0:3], 0 offen offset:100
 ; CHECK-NEXT:    buffer_load_dword v19, v2, s[0:3], 0 offen offset:96
-; CHECK-NEXT:    buffer_load_dword v26, v2, s[0:3], 0 offen offset:252
-; CHECK-NEXT:    buffer_load_dword v25, v2, s[0:3], 0 offen offset:248
-; CHECK-NEXT:    buffer_load_dword v24, v2, s[0:3], 0 offen offset:244
-; CHECK-NEXT:    buffer_load_dword v23, v2, s[0:3], 0 offen offset:240
 ; CHECK-NEXT:    buffer_load_dword v30, v2, s[0:3], 0 offen offset:236
+; CHECK-NEXT:    buffer_load_dword v34, v2, s[0:3], 0 offen offset:252
+; CHECK-NEXT:    buffer_load_dword v33, v2, s[0:3], 0 offen offset:248
+; CHECK-NEXT:    buffer_load_dword v32, v2, s[0:3], 0 offen offset:244
+; CHECK-NEXT:    buffer_load_dword v31, v2, s[0:3], 0 offen offset:240
 ; CHECK-NEXT:    buffer_load_dword v29, v2, s[0:3], 0 offen offset:232
 ; CHECK-NEXT:    buffer_load_dword v28, v2, s[0:3], 0 offen offset:228
 ; CHECK-NEXT:    buffer_load_dword v27, v2, s[0:3], 0 offen offset:224
-; CHECK-NEXT:    buffer_load_dword v34, v2, s[0:3], 0 offen offset:220
-; CHECK-NEXT:    buffer_load_dword v33, v2, s[0:3], 0 offen offset:216
-; CHECK-NEXT:    buffer_load_dword v32, v2, s[0:3], 0 offen offset:212
-; CHECK-NEXT:    buffer_load_dword v31, v2, s[0:3], 0 offen offset:208
 ; CHECK-NEXT:    buffer_load_dword v38, v2, s[0:3], 0 offen offset:204
+; CHECK-NEXT:    buffer_load_dword v51, v2, s[0:3], 0 offen offset:220
+; CHECK-NEXT:    buffer_load_dword v50, v2, s[0:3], 0 offen offset:216
+; CHECK-NEXT:    buffer_load_dword v49, v2, s[0:3], 0 offen offset:212
+; CHECK-NEXT:    buffer_load_dword v48, v2, s[0:3], 0 offen offset:208
 ; CHECK-NEXT:    buffer_load_dword v37, v2, s[0:3], 0 offen offset:200
 ; CHECK-NEXT:    buffer_load_dword v36, v2, s[0:3], 0 offen offset:196
 ; CHECK-NEXT:    buffer_load_dword v35, v2, s[0:3], 0 offen offset:192
-; CHECK-NEXT:    buffer_load_dword v51, v2, s[0:3], 0 offen offset:188
-; CHECK-NEXT:    buffer_load_dword v50, v2, s[0:3], 0 offen offset:184
-; CHECK-NEXT:    buffer_load_dword v49, v2, s[0:3], 0 offen offset:180
-; CHECK-NEXT:    buffer_load_dword v48, v2, s[0:3], 0 offen offset:176
 ; CHECK-NEXT:    buffer_load_dword v55, v2, s[0:3], 0 offen offset:172
+; CHECK-NEXT:    buffer_load_dword v67, v2, s[0:3], 0 offen offset:188
+; CHECK-NEXT:    buffer_load_dword v66, v2, s[0:3], 0 offen offset:184
+; CHECK-NEXT:    buffer_load_dword v65, v2, s[0:3], 0 offen offset:180
+; CHECK-NEXT:    buffer_load_dword v64, v2, s[0:3], 0 offen offset:176
 ; CHECK-NEXT:    buffer_load_dword v54, v2, s[0:3], 0 offen offset:168
 ; CHECK-NEXT:    buffer_load_dword v53, v2, s[0:3], 0 offen offset:164
 ; CHECK-NEXT:    buffer_load_dword v52, v2, s[0:3], 0 offen offset:160
-; CHECK-NEXT:    buffer_load_dword v67, v2, s[0:3], 0 offen offset:156
-; CHECK-NEXT:    buffer_load_dword v66, v2, s[0:3], 0 offen offset:152
-; CHECK-NEXT:    buffer_load_dword v65, v2, s[0:3], 0 offen offset:148
-; CHECK-NEXT:    buffer_load_dword v64, v2, s[0:3], 0 offen offset:144
-; CHECK-NEXT:    buffer_load_dword v71, v2, s[0:3], 0 offen offset:140
-; CHECK-NEXT:    buffer_load_dword v70, v2, s[0:3], 0 offen offset:136
-; CHECK-NEXT:    buffer_load_dword v69, v2, s[0:3], 0 offen offset:132
-; CHECK-NEXT:    buffer_load_dword v68, v2, s[0:3], 0 offen offset:128
-; CHECK-NEXT:    buffer_load_dword v83, v2, s[0:3], 0 offen offset:92
-; CHECK-NEXT:    buffer_load_dword v82, v2, s[0:3], 0 offen offset:88
-; CHECK-NEXT:    buffer_load_dword v81, v2, s[0:3], 0 offen offset:84
-; CHECK-NEXT:    buffer_load_dword v80, v2, s[0:3], 0 offen offset:80
-; CHECK-NEXT:    buffer_load_dword v87, v2, s[0:3], 0 offen offset:76
-; CHECK-NEXT:    buffer_load_dword v86, v2, s[0:3], 0 offen offset:72
-; CHECK-NEXT:    buffer_load_dword v85, v2, s[0:3], 0 offen offset:68
-; CHECK-NEXT:    buffer_load_dword v84, v2, s[0:3], 0 offen offset:64
-; CHECK-NEXT:    buffer_load_dword v96, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v97, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v98, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v99, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_dword v71, v2, s[0:3], 0 offen offset:156
+; CHECK-NEXT:    buffer_load_dword v70, v2, s[0:3], 0 offen offset:152
+; CHECK-NEXT:    buffer_load_dword v69, v2, s[0:3], 0 offen offset:148
+; CHECK-NEXT:    buffer_load_dword v68, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT:    buffer_load_dword v83, v2, s[0:3], 0 offen offset:140
+; CHECK-NEXT:    buffer_load_dword v82, v2, s[0:3], 0 offen offset:136
+; CHECK-NEXT:    buffer_load_dword v81, v2, s[0:3], 0 offen offset:132
+; CHECK-NEXT:    buffer_load_dword v80, v2, s[0:3], 0 offen offset:128
+; CHECK-NEXT:    buffer_load_dword v84, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v85, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v86, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v96, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v97, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v98, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v99, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v87, v2, s[0:3], 0 offen offset:12
 ; CHECK-NEXT:    v_add_co_u32 v100, vcc_lo, v0, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v101, null, s5, v1, vcc_lo
 ; CHECK-NEXT:    v_add_nc_u32_e32 v2, 0xffffff00, v2
 ; CHECK-NEXT:    s_add_u32 s4, s4, 0xffffff00
 ; CHECK-NEXT:    s_addc_u32 s5, s5, -1
-; CHECK-NEXT:    s_waitcnt vmcnt(41)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[23:26] offset:240
-; CHECK-NEXT:    s_waitcnt vmcnt(37)
+; CHECK-NEXT:    s_waitcnt vmcnt(35)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[31:34] offset:240
+; CHECK-NEXT:    s_waitcnt vmcnt(32)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[27:30] offset:224
-; CHECK-NEXT:    s_waitcnt vmcnt(33)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[31:34] offset:208
-; CHECK-NEXT:    s_waitcnt vmcnt(29)
+; CHECK-NEXT:    s_waitcnt vmcnt(27)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[48:51] offset:208
+; CHECK-NEXT:    s_waitcnt vmcnt(24)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[35:38] offset:192
-; CHECK-NEXT:    s_waitcnt vmcnt(25)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[48:51] offset:176
-; CHECK-NEXT:    s_waitcnt vmcnt(21)
+; CHECK-NEXT:    s_waitcnt vmcnt(19)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[64:67] offset:176
+; CHECK-NEXT:    s_waitcnt vmcnt(16)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[52:55] offset:160
-; CHECK-NEXT:    s_waitcnt vmcnt(17)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[64:67] offset:144
-; CHECK-NEXT:    s_waitcnt vmcnt(13)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[68:71] offset:128
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[15:18] offset:112
+; CHECK-NEXT:    s_waitcnt vmcnt(12)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[68:71] offset:144
+; CHECK-NEXT:    s_waitcnt vmcnt(8)
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[80:83] offset:128
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[23:26] offset:112
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[19:22] offset:96
-; CHECK-NEXT:    s_waitcnt vmcnt(9)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[80:83] offset:80
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87] offset:64
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[11:14] offset:48
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[7:10] offset:32
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[15:18] offset:80
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[11:14] offset:64
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[7:10] offset:48
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[3:6] offset:32
 ; CHECK-NEXT:    s_waitcnt vmcnt(1)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[3:6] offset:16
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87]
 ; CHECK-NEXT:    s_cmp_eq_u64 s[4:5], s[6:7]
 ; CHECK-NEXT:    s_cbranch_scc0 .LBB9_4
 ; CHECK-NEXT:  .LBB9_5: ; %Flow11
@@ -12736,16 +12731,17 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v1, v2, s[0:3], 0 offen offset:21
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:22
 ; ALIGNED-NEXT:    buffer_load_ubyte v4, v2, s[0:3], 0 offen offset:23
-; ALIGNED-NEXT:    buffer_load_ubyte v7, v2, s[0:3], 0 offen offset:24
+; ALIGNED-NEXT:    buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:24
 ; ALIGNED-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:25
 ; ALIGNED-NEXT:    buffer_load_ubyte v12, v2, s[0:3], 0 offen offset:26
-; ALIGNED-NEXT:    buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:30
-; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:31
+; ALIGNED-NEXT:    buffer_load_ubyte v127, v2, s[0:3], 0 offen offset:19
+; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:28
+; ALIGNED-NEXT:    buffer_load_ubyte v7, v2, s[0:3], 0 offen offset:29
+; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:30
+; ALIGNED-NEXT:    buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:31
 ; ALIGNED-NEXT:    buffer_load_ubyte v14, v2, s[0:3], 0 offen offset:32
 ; ALIGNED-NEXT:    buffer_load_ubyte v15, v2, s[0:3], 0 offen offset:33
 ; ALIGNED-NEXT:    buffer_load_ubyte v17, v2, s[0:3], 0 offen offset:34
-; ALIGNED-NEXT:    buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:29
-; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:28
 ; ALIGNED-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:27
 ; ALIGNED-NEXT:    buffer_load_ubyte v19, v2, s[0:3], 0 offen offset:35
 ; ALIGNED-NEXT:    buffer_load_ubyte v13, v2, s[0:3], 0 offen offset:36
@@ -12768,16 +12764,16 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v36, v2, s[0:3], 0 offen offset:53
 ; ALIGNED-NEXT:    buffer_load_ubyte v33, v2, s[0:3], 0 offen offset:54
 ; ALIGNED-NEXT:    buffer_load_ubyte v35, v2, s[0:3], 0 offen offset:55
-; ALIGNED-NEXT:    buffer_load_ubyte v49, v2, s[0:3], 0 offen offset:56
+; ALIGNED-NEXT:    buffer_load_ubyte v39, v2, s[0:3], 0 offen offset:56
 ; ALIGNED-NEXT:    buffer_load_ubyte v50, v2, s[0:3], 0 offen offset:57
 ; ALIGNED-NEXT:    buffer_load_ubyte v52, v2, s[0:3], 0 offen offset:58
+; ALIGNED-NEXT:    buffer_load_ubyte v37, v2, s[0:3], 0 offen offset:60
+; ALIGNED-NEXT:    buffer_load_ubyte v48, v2, s[0:3], 0 offen offset:61
 ; ALIGNED-NEXT:    buffer_load_ubyte v38, v2, s[0:3], 0 offen offset:62
-; ALIGNED-NEXT:    buffer_load_ubyte v39, v2, s[0:3], 0 offen offset:63
+; ALIGNED-NEXT:    buffer_load_ubyte v49, v2, s[0:3], 0 offen offset:63
 ; ALIGNED-NEXT:    buffer_load_ubyte v53, v2, s[0:3], 0 offen offset:64
 ; ALIGNED-NEXT:    buffer_load_ubyte v54, v2, s[0:3], 0 offen offset:65
 ; ALIGNED-NEXT:    buffer_load_ubyte v65, v2, s[0:3], 0 offen offset:66
-; ALIGNED-NEXT:    buffer_load_ubyte v48, v2, s[0:3], 0 offen offset:61
-; ALIGNED-NEXT:    buffer_load_ubyte v37, v2, s[0:3], 0 offen offset:60
 ; ALIGNED-NEXT:    buffer_load_ubyte v51, v2, s[0:3], 0 offen offset:59
 ; ALIGNED-NEXT:    buffer_load_ubyte v55, v2, s[0:3], 0 offen offset:67
 ; ALIGNED-NEXT:    buffer_load_ubyte v64, v2, s[0:3], 0 offen offset:68
@@ -12786,10 +12782,9 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v68, v2, s[0:3], 0 offen offset:71
 ; ALIGNED-NEXT:    buffer_load_ubyte v69, v2, s[0:3], 0 offen offset:76
 ; ALIGNED-NEXT:    buffer_load_ubyte v70, v2, s[0:3], 0 offen offset:77
+; ALIGNED-NEXT:    buffer_load_ubyte v81, v2, s[0:3], 0 offen offset:75
 ; ALIGNED-NEXT:    buffer_load_ubyte v71, v2, s[0:3], 0 offen offset:78
 ; ALIGNED-NEXT:    buffer_load_ubyte v80, v2, s[0:3], 0 offen offset:79
-; ALIGNED-NEXT:    buffer_load_ubyte v127, v2, s[0:3], 0 offen offset:19
-; ALIGNED-NEXT:    buffer_load_ubyte v81, v2, s[0:3], 0 offen offset:75
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(57)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(56)
@@ -12799,46 +12794,46 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(54)
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(53)
-; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(52)
 ; ALIGNED-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(51)
 ; ALIGNED-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(50)
-; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(49)
-; ALIGNED-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(48)
+; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(47)
+; ALIGNED-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(46)
+; ALIGNED-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(45)
 ; ALIGNED-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v4, 8, v3
-; ALIGNED-NEXT:    s_waitcnt vmcnt(45)
-; ALIGNED-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(44)
-; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(43)
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v7, 8, v5
+; ALIGNED-NEXT:    s_waitcnt vmcnt(42)
 ; ALIGNED-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v9, 8, v5
-; ALIGNED-NEXT:    s_waitcnt vmcnt(41)
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v9, 8, v8
+; ALIGNED-NEXT:    s_waitcnt vmcnt(40)
 ; ALIGNED-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v4, v8, 8, v6
-; ALIGNED-NEXT:    v_lshl_or_b32 v5, v10, 8, v7
+; ALIGNED-NEXT:    v_lshl_or_b32 v5, v10, 8, v6
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v11, 8, v12
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v15, 8, v14
 ; ALIGNED-NEXT:    v_lshl_or_b32 v8, v19, 8, v17
-; ALIGNED-NEXT:    s_waitcnt vmcnt(40)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(39)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v9, v16, 8, v13
-; ALIGNED-NEXT:    s_waitcnt vmcnt(38)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(37)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v10, v20, 8, v18
-; ALIGNED-NEXT:    s_waitcnt vmcnt(36)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(35)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v11, v23, 8, v22
-; ALIGNED-NEXT:    s_waitcnt vmcnt(34)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(33)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v12, v27, 8, v25
-; ALIGNED-NEXT:    s_waitcnt vmcnt(32)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(31)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v13, v24, 8, v21
-; ALIGNED-NEXT:    s_waitcnt vmcnt(30)
-; ALIGNED-NEXT:    v_lshl_or_b32 v14, v28, 8, v26
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
+; ALIGNED-NEXT:    s_waitcnt vmcnt(29)
+; ALIGNED-NEXT:    v_lshl_or_b32 v14, v28, 8, v26
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v6, 16, v5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v8, 16, v7
@@ -12846,26 +12841,27 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v12, 16, v11
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v14, 16, v13
 ; ALIGNED-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(28)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(27)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v15, v30, 8, v29
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:824 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(26)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(25)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v32, 8, v34
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:832 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(24)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(23)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v36, 8, v31
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(22)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(21)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v35, 8, v33
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(12)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(16)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v48, 8, v37
-; ALIGNED-NEXT:    v_lshl_or_b32 v5, v39, 8, v38
+; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(14)
+; ALIGNED-NEXT:    v_lshl_or_b32 v5, v49, 8, v38
 ; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:876 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v6, v50, 8, v49
+; ALIGNED-NEXT:    v_lshl_or_b32 v6, v50, 8, v39
 ; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(11)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(10)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v51, 8, v52
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v0, 16, v15
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v3, 16, v1
@@ -12875,13 +12871,13 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v54, 8, v53
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:924 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(11)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(10)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v55, 8, v65
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(9)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v66, 8, v64
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:948 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v68, 8, v67
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_clause 0x1
@@ -12890,13 +12886,13 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:784 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v4, 16, v3
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v70, 8, v69
 ; ALIGNED-NEXT:    s_clause 0x1
 ; ALIGNED-NEXT:    buffer_load_ubyte v4, v2, s[0:3], 0 offen offset:83
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:74
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:984 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(5)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v80, 8, v71
 ; ALIGNED-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
@@ -12922,11 +12918,11 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v36, off, s[0:3], s32 offset:872 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:860 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v35, off, s[0:3], s32 offset:864 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    buffer_store_dword v38, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    buffer_store_dword v39, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v37, off, s[0:3], s32 offset:884 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    buffer_store_dword v49, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v48, off, s[0:3], s32 offset:896 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v38, off, s[0:3], s32 offset:888 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v49, off, s[0:3], s32 offset:900 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v39, off, s[0:3], s32 offset:892 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v50, off, s[0:3], s32 offset:904 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v52, off, s[0:3], s32 offset:912 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v51, off, s[0:3], s32 offset:908 ; 4-byte Folded Spill
@@ -12942,9 +12938,8 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v70, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v71, off, s[0:3], s32 offset:976 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v80, off, s[0:3], s32 offset:980 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
 ; ALIGNED-NEXT:    buffer_store_dword v81, off, s[0:3], s32 offset:1000 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    buffer_store_dword v127, off, s[0:3], s32 offset:1404 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v127, off, s[0:3], s32 offset:1400 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:87
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
 ; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:1032 ; 4-byte Folded Spill
@@ -13227,39 +13222,39 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v120, 8, v111
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:1400 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:1404 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
 ; ALIGNED-NEXT:    buffer_load_ubyte v121, v2, s[0:3], 0 offen offset:149
 ; ALIGNED-NEXT:    buffer_load_ubyte v122, v2, s[0:3], 0 offen offset:150
-; ALIGNED-NEXT:    buffer_load_ubyte v109, v2, s[0:3], 0 offen offset:151
+; ALIGNED-NEXT:    buffer_load_ubyte v110, v2, s[0:3], 0 offen offset:151
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1408 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v121, 8, v3
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v109, 8, v122
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v110, 8, v122
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1412 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
 ; ALIGNED-NEXT:    buffer_load_ubyte v108, v2, s[0:3], 0 offen offset:156
 ; ALIGNED-NEXT:    buffer_load_ubyte v105, v2, s[0:3], 0 offen offset:157
-; ALIGNED-NEXT:    buffer_load_ubyte v107, v2, s[0:3], 0 offen offset:158
+; ALIGNED-NEXT:    buffer_load_ubyte v106, v2, s[0:3], 0 offen offset:158
 ; ALIGNED-NEXT:    buffer_load_ubyte v104, v2, s[0:3], 0 offen offset:159
-; ALIGNED-NEXT:    buffer_load_ubyte v95, v2, s[0:3], 0 offen offset:155
+; ALIGNED-NEXT:    buffer_load_ubyte v94, v2, s[0:3], 0 offen offset:155
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v105, 8, v108
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v104, 8, v107
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v104, 8, v106
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1416 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
-; ALIGNED-NEXT:    buffer_load_ubyte v93, v2, s[0:3], 0 offen offset:152
+; ALIGNED-NEXT:    buffer_load_ubyte v95, v2, s[0:3], 0 offen offset:152
 ; ALIGNED-NEXT:    buffer_load_ubyte v92, v2, s[0:3], 0 offen offset:153
 ; ALIGNED-NEXT:    buffer_load_ubyte v90, v2, s[0:3], 0 offen offset:154
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v92, 8, v93
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v92, 8, v95
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v95, 8, v90
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v94, 8, v90
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1420 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x7
@@ -13268,8 +13263,8 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v73, v2, s[0:3], 0 offen offset:162
 ; ALIGNED-NEXT:    buffer_load_ubyte v74, v2, s[0:3], 0 offen offset:163
 ; ALIGNED-NEXT:    buffer_load_ubyte v88, v2, s[0:3], 0 offen offset:164
-; ALIGNED-NEXT:    buffer_load_ubyte v75, v2, s[0:3], 0 offen offset:165
-; ALIGNED-NEXT:    buffer_load_ubyte v76, v2, s[0:3], 0 offen offset:166
+; ALIGNED-NEXT:    buffer_load_ubyte v76, v2, s[0:3], 0 offen offset:165
+; ALIGNED-NEXT:    buffer_load_ubyte v75, v2, s[0:3], 0 offen offset:166
 ; ALIGNED-NEXT:    buffer_load_ubyte v72, v2, s[0:3], 0 offen offset:167
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v79, 8, v89
@@ -13277,9 +13272,9 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v74, 8, v73
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v72, 8, v76
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v72, 8, v75
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1424 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v75, 8, v88
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v76, 8, v88
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1428 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
@@ -13356,15 +13351,16 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v98, v2, s[0:3], 0 offen offset:198
 ; ALIGNED-NEXT:    buffer_load_ubyte v87, v2, s[0:3], 0 offen offset:199
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v100, 8, v102
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v100, 8, v102
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v96, 8, v97
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v3, 16, v0
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v96, 8, v97
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v4, 16, v3
+; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v99, 8, v101
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v87, 8, v98
+; ALIGNED-NEXT:    v_lshl_or_b32 v4, v87, 8, v98
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1456 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v99, 8, v101
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v3, 16, v0
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v4, 16, v3
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1460 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
 ; ALIGNED-NEXT:    buffer_load_ubyte v85, v2, s[0:3], 0 offen offset:204
@@ -13380,10 +13376,10 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1464 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
 ; ALIGNED-NEXT:    buffer_load_ubyte v80, v2, s[0:3], 0 offen offset:200
-; ALIGNED-NEXT:    buffer_load_ubyte v70, v2, s[0:3], 0 offen offset:201
+; ALIGNED-NEXT:    buffer_load_ubyte v71, v2, s[0:3], 0 offen offset:201
 ; ALIGNED-NEXT:    buffer_load_ubyte v69, v2, s[0:3], 0 offen offset:202
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v70, 8, v80
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v71, 8, v80
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v81, 8, v69
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v4, 16, v3
@@ -13475,11 +13471,11 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:248
 ; ALIGNED-NEXT:    buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:249
 ; ALIGNED-NEXT:    buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:250
-; ALIGNED-NEXT:    v_lshl_or_b32 v110, v4, 16, v3
+; ALIGNED-NEXT:    v_lshl_or_b32 v109, v4, 16, v3
 ; ALIGNED-NEXT:    s_clause 0x4
 ; ALIGNED-NEXT:    buffer_load_ubyte v1, v2, s[0:3], 0 offen
 ; ALIGNED-NEXT:    buffer_load_ubyte v0, v2, s[0:3], 0 offen offset:3
-; ALIGNED-NEXT:    buffer_load_ubyte v106, v2, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT:    buffer_load_ubyte v107, v2, s[0:3], 0 offen offset:4
 ; ALIGNED-NEXT:    buffer_load_ubyte v123, v2, s[0:3], 0 offen offset:5
 ; ALIGNED-NEXT:    buffer_load_ubyte v125, v2, s[0:3], 0 offen offset:6
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(27)
@@ -13490,26 +13486,26 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v77, v13, 8, v16
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(9)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v91, v9, 8, v10
-; ALIGNED-NEXT:    v_lshl_or_b32 v94, v4, 16, v3
+; ALIGNED-NEXT:    v_lshl_or_b32 v93, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v22, 8, v24
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v23, 8, v21
 ; ALIGNED-NEXT:    v_lshl_or_b32 v78, v4, 16, v3
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v18, 8, v20
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v14, 8, v15
 ; ALIGNED-NEXT:    v_lshl_or_b32 v103, v4, 16, v3
-; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:7
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v17, 8, v19
+; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:7
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1292 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
-; ALIGNED-NEXT:    buffer_store_dword v106, off, s[0:3], s32 offset:1296 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v107, off, s[0:3], s32 offset:1296 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
 ; ALIGNED-NEXT:    buffer_store_dword v123, off, s[0:3], s32 offset:1304 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
-; ALIGNED-NEXT:    buffer_store_dword v125, off, s[0:3], s32 offset:1308 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v86, v77, 16, v4
 ; ALIGNED-NEXT:    v_lshl_or_b32 v77, v11, 8, v12
-; ALIGNED-NEXT:    v_lshl_or_b32 v71, v91, 16, v77
+; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
+; ALIGNED-NEXT:    buffer_store_dword v125, off, s[0:3], s32 offset:1308 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    v_lshl_or_b32 v70, v91, 16, v77
 ; ALIGNED-NEXT:    v_lshl_or_b32 v77, v6, 8, v8
 ; ALIGNED-NEXT:    v_lshl_or_b32 v91, v7, 8, v5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v4, v91, 16, v77
@@ -13527,7 +13523,7 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v91, v0, 8, v91
 ; ALIGNED-NEXT:    buffer_load_ubyte v1, v2, s[0:3], 0 offen offset:12
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v91, 16, v77
-; ALIGNED-NEXT:    v_lshl_or_b32 v77, v123, 8, v106
+; ALIGNED-NEXT:    v_lshl_or_b32 v77, v123, 8, v107
 ; ALIGNED-NEXT:    v_lshl_or_b32 v91, v3, 8, v125
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:13
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1316 ; 4-byte Folded Spill
@@ -13560,21 +13556,21 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v91, 16, v77
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1392 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
-; ALIGNED-NEXT:    buffer_load_ubyte v106, v2, s[0:3], 0 offen offset:16
-; ALIGNED-NEXT:    buffer_load_ubyte v77, v2, s[0:3], 0 offen offset:18
-; ALIGNED-NEXT:    buffer_load_ubyte v91, v2, s[0:3], 0 offen offset:17
+; ALIGNED-NEXT:    buffer_load_ubyte v91, v2, s[0:3], 0 offen offset:18
+; ALIGNED-NEXT:    buffer_load_ubyte v107, v2, s[0:3], 0 offen offset:16
+; ALIGNED-NEXT:    buffer_load_ubyte v77, v2, s[0:3], 0 offen offset:17
 ; ALIGNED-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:232
-; ALIGNED-NEXT:    buffer_store_dword v71, off, s[0:3], s32 offset:236
+; ALIGNED-NEXT:    buffer_store_dword v70, off, s[0:3], s32 offset:236
 ; ALIGNED-NEXT:    buffer_store_dword v86, off, s[0:3], s32 offset:228
 ; ALIGNED-NEXT:    buffer_store_dword v103, off, s[0:3], s32 offset:224
 ; ALIGNED-NEXT:    s_clause 0x1
 ; ALIGNED-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:704
 ; ALIGNED-NEXT:    buffer_load_dword v4, off, s[0:3], s32 offset:708
 ; ALIGNED-NEXT:    v_add_nc_u32_e32 v2, 0x100, v2
-; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v127, 8, v77
+; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v127, 8, v91
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
-; ALIGNED-NEXT:    v_lshl_or_b32 v127, v91, 8, v106
+; ALIGNED-NEXT:    v_lshl_or_b32 v127, v77, 8, v107
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
 ; ALIGNED-NEXT:    v_add_co_u32 v3, vcc_lo, v3, s4
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
@@ -13596,8 +13592,8 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v19 offset:244
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v20 offset:240
 ; ALIGNED-NEXT:    buffer_store_dword v78, off, s[0:3], s32 offset:248
-; ALIGNED-NEXT:    buffer_store_dword v94, off, s[0:3], s32 offset:252
-; ALIGNED-NEXT:    buffer_store_dword v110, off, s[0:3], s32 offset:244
+; ALIGNED-NEXT:    buffer_store_dword v93, off, s[0:3], s32 offset:252
+; ALIGNED-NEXT:    buffer_store_dword v109, off, s[0:3], s32 offset:244
 ; ALIGNED-NEXT:    v_lshl_or_b32 v127, v0, 16, v127
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1488 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_add_u32 s4, s4, 0x100
@@ -13663,7 +13659,7 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:208
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v69 offset:202
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v81 offset:203
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v70 offset:201
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v71 offset:201
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v82 offset:207
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v83 offset:205
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v84 offset:206
@@ -13729,8 +13725,8 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v74 offset:163
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v79 offset:161
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v72 offset:167
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v75 offset:165
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v76 offset:166
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v76 offset:165
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v75 offset:166
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v88 offset:164
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v89 offset:160
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1420 ; 4-byte Folded Reload
@@ -13746,20 +13742,20 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:256
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v90 offset:154
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v95 offset:155
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v94 offset:155
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v92 offset:153
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v104 offset:159
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v105 offset:157
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v107 offset:158
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v106 offset:158
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v108 offset:156
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v93 offset:152
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v95 offset:152
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v111 offset:146
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v120 offset:147
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v124 offset:145
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v109 offset:151
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v110 offset:151
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v121 offset:149
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v122 offset:150
-; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1400 ; 4-byte Folded Reload
+; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1404 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:148
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1396 ; 4-byte Folded Reload
@@ -14219,11 +14215,11 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:24
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v77 offset:18
-; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1404 ; 4-byte Folded Reload
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v91 offset:18
+; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1400 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:19
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v91 offset:17
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v77 offset:17
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:23
@@ -14236,7 +14232,7 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0 offset:20
-; ALIGNED-NEXT:    flat_store_byte v[3:4], v106 offset:16
+; ALIGNED-NEXT:    flat_store_byte v[3:4], v107 offset:16
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1392 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:408
@@ -14305,16 +14301,17 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v1, v4, s[0:3], 0 offen offset:21
 ; ALIGNED-NEXT:    buffer_load_ubyte v2, v4, s[0:3], 0 offen offset:22
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v4, s[0:3], 0 offen offset:23
-; ALIGNED-NEXT:    buffer_load_ubyte v7, v4, s[0:3], 0 offen offset:24
+; ALIGNED-NEXT:    buffer_load_ubyte v6, v4, s[0:3], 0 offen offset:24
 ; ALIGNED-NEXT:    buffer_load_ubyte v10, v4, s[0:3], 0 offen offset:25
 ; ALIGNED-NEXT:    buffer_load_ubyte v12, v4, s[0:3], 0 offen offset:26
-; ALIGNED-NEXT:    buffer_load_ubyte v6, v4, s[0:3], 0 offen offset:30
-; ALIGNED-NEXT:    buffer_load_ubyte v8, v4, s[0:3], 0 offen offset:31
+; ALIGNED-NEXT:    buffer_load_ubyte v126, v4, s[0:3], 0 offen offset:19
+; ALIGNED-NEXT:    buffer_load_ubyte v5, v4, s[0:3], 0 offen offset:28
+; ALIGNED-NEXT:    buffer_load_ubyte v7, v4, s[0:3], 0 offen offset:29
+; ALIGNED-NEXT:    buffer_load_ubyte v8, v4, s[0:3], 0 offen offset:30
+; ALIGNED-NEXT:    buffer_load_ubyte v9, v4, s[0:3], 0 offen offset:31
 ; ALIGNED-NEXT:    buffer_load_ubyte v14, v4, s[0:3], 0 offen offset:32
 ; ALIGNED-NEXT:    buffer_load_ubyte v15, v4, s[0:3], 0 offen offset:33
 ; ALIGNED-NEXT:    buffer_load_ubyte v17, v4, s[0:3], 0 offen offset:34
-; ALIGNED-NEXT:    buffer_load_ubyte v9, v4, s[0:3], 0 offen offset:29
-; ALIGNED-NEXT:    buffer_load_ubyte v5, v4, s[0:3], 0 offen offset:28
 ; ALIGNED-NEXT:    buffer_load_ubyte v11, v4, s[0:3], 0 offen offset:27
 ; ALIGNED-NEXT:    buffer_load_ubyte v19, v4, s[0:3], 0 offen offset:35
 ; ALIGNED-NEXT:    buffer_load_ubyte v13, v4, s[0:3], 0 offen offset:36
@@ -14355,10 +14352,9 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_ubyte v68, v4, s[0:3], 0 offen offset:71
 ; ALIGNED-NEXT:    buffer_load_ubyte v69, v4, s[0:3], 0 offen offset:76
 ; ALIGNED-NEXT:    buffer_load_ubyte v70, v4, s[0:3], 0 offen offset:77
+; ALIGNED-NEXT:    buffer_load_ubyte v81, v4, s[0:3], 0 offen offset:75
 ; ALIGNED-NEXT:    buffer_load_ubyte v71, v4, s[0:3], 0 offen offset:78
 ; ALIGNED-NEXT:    buffer_load_ubyte v80, v4, s[0:3], 0 offen offset:79
-; ALIGNED-NEXT:    buffer_load_ubyte v126, v4, s[0:3], 0 offen offset:19
-; ALIGNED-NEXT:    buffer_load_ubyte v81, v4, s[0:3], 0 offen offset:75
 ; ALIGNED-NEXT:    buffer_load_ubyte v125, v4, s[0:3], 0 offen offset:151
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(58)
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:712 ; 4-byte Folded Spill
@@ -14369,46 +14365,46 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(55)
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:724 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(54)
-; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:732 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(53)
 ; ALIGNED-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:748 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(52)
 ; ALIGNED-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:756 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(51)
-; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(50)
-; ALIGNED-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(49)
+; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(48)
+; ALIGNED-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:740 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(47)
+; ALIGNED-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:744 ; 4-byte Folded Spill
+; ALIGNED-NEXT:    s_waitcnt vmcnt(46)
 ; ALIGNED-NEXT:    buffer_store_dword v14, off, s[0:3], s32 offset:768 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 8, v0
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v3, 8, v2
-; ALIGNED-NEXT:    s_waitcnt vmcnt(46)
-; ALIGNED-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:736 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(45)
-; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:728 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(44)
+; ALIGNED-NEXT:    v_lshl_or_b32 v2, v7, 8, v5
+; ALIGNED-NEXT:    s_waitcnt vmcnt(43)
 ; ALIGNED-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:752 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v2, v9, 8, v5
-; ALIGNED-NEXT:    s_waitcnt vmcnt(42)
+; ALIGNED-NEXT:    v_lshl_or_b32 v3, v9, 8, v8
+; ALIGNED-NEXT:    s_waitcnt vmcnt(41)
 ; ALIGNED-NEXT:    buffer_store_dword v13, off, s[0:3], s32 offset:760 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v3, v8, 8, v6
-; ALIGNED-NEXT:    v_lshl_or_b32 v5, v10, 8, v7
+; ALIGNED-NEXT:    v_lshl_or_b32 v5, v10, 8, v6
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v11, 8, v12
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v15, 8, v14
 ; ALIGNED-NEXT:    v_lshl_or_b32 v8, v19, 8, v17
-; ALIGNED-NEXT:    s_waitcnt vmcnt(41)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(40)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v9, v16, 8, v13
-; ALIGNED-NEXT:    s_waitcnt vmcnt(39)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(38)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v10, v20, 8, v18
-; ALIGNED-NEXT:    s_waitcnt vmcnt(37)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(36)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v11, v23, 8, v22
-; ALIGNED-NEXT:    s_waitcnt vmcnt(35)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(34)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v12, v28, 8, v25
-; ALIGNED-NEXT:    s_waitcnt vmcnt(33)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(32)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v13, v24, 8, v21
-; ALIGNED-NEXT:    s_waitcnt vmcnt(31)
-; ALIGNED-NEXT:    v_lshl_or_b32 v14, v27, 8, v26
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
+; ALIGNED-NEXT:    s_waitcnt vmcnt(30)
+; ALIGNED-NEXT:    v_lshl_or_b32 v14, v27, 8, v26
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v3, 16, v2
 ; ALIGNED-NEXT:    v_lshl_or_b32 v2, v6, 16, v5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v8, 16, v7
@@ -14416,27 +14412,27 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v12, 16, v11
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v14, 16, v13
 ; ALIGNED-NEXT:    buffer_store_dword v15, off, s[0:3], s32 offset:772 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(29)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(28)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v15, v31, 8, v30
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:780 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(27)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(26)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v34, 8, v33
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:796 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(25)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(24)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v37, 8, v32
 ; ALIGNED-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:800 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(23)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(22)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v2, v36, 8, v35
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:840 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(18)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(17)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v50, 8, v38
 ; ALIGNED-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:852 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(16)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(15)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v5, v49, 8, v39
 ; ALIGNED-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:868 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v6, v51, 8, v48
 ; ALIGNED-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:880 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(12)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(11)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v7, v53, 8, v52
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v0, 16, v15
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v2, 16, v1
@@ -14446,13 +14442,13 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:916 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v55, 8, v29
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:920 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(12)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(11)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v67, 8, v66
 ; ALIGNED-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:928 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(10)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(9)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v2, v64, 8, v54
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:932 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v68, 8, v65
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_clause 0x1
@@ -14461,13 +14457,13 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v17, off, s[0:3], s32 offset:788 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v3, 16, v2
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:976 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v70, 8, v69
 ; ALIGNED-NEXT:    s_clause 0x1
 ; ALIGNED-NEXT:    buffer_load_ubyte v3, v4, s[0:3], 0 offen offset:83
 ; ALIGNED-NEXT:    buffer_load_ubyte v2, v4, s[0:3], 0 offen offset:74
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:988 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
+; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v1, v80, 8, v71
 ; ALIGNED-NEXT:    buffer_store_dword v19, off, s[0:3], s32 offset:792 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v16, off, s[0:3], s32 offset:764 ; 4-byte Folded Spill
@@ -14513,9 +14509,7 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_store_dword v70, off, s[0:3], s32 offset:968 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v71, off, s[0:3], s32 offset:972 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v80, off, s[0:3], s32 offset:980 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(9)
 ; ALIGNED-NEXT:    buffer_store_dword v126, off, s[0:3], s32 offset:1416 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    s_waitcnt vmcnt(8)
 ; ALIGNED-NEXT:    buffer_store_dword v81, off, s[0:3], s32 offset:1000 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_load_ubyte v8, v4, s[0:3], 0 offen offset:87
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(7)
@@ -14840,21 +14834,21 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    s_clause 0x7
 ; ALIGNED-NEXT:    buffer_load_ubyte v89, v4, s[0:3], 0 offen offset:160
 ; ALIGNED-NEXT:    buffer_load_ubyte v79, v4, s[0:3], 0 offen offset:161
-; ALIGNED-NEXT:    buffer_load_ubyte v73, v4, s[0:3], 0 offen offset:162
+; ALIGNED-NEXT:    buffer_load_ubyte v75, v4, s[0:3], 0 offen offset:162
 ; ALIGNED-NEXT:    buffer_load_ubyte v74, v4, s[0:3], 0 offen offset:163
 ; ALIGNED-NEXT:    buffer_load_ubyte v88, v4, s[0:3], 0 offen offset:164
-; ALIGNED-NEXT:    buffer_load_ubyte v75, v4, s[0:3], 0 offen offset:165
-; ALIGNED-NEXT:    buffer_load_ubyte v77, v4, s[0:3], 0 offen offset:166
+; ALIGNED-NEXT:    buffer_load_ubyte v77, v4, s[0:3], 0 offen offset:165
+; ALIGNED-NEXT:    buffer_load_ubyte v76, v4, s[0:3], 0 offen offset:166
 ; ALIGNED-NEXT:    buffer_load_ubyte v72, v4, s[0:3], 0 offen offset:167
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(6)
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v79, 8, v89
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v74, 8, v73
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v74, 8, v75
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
-; ALIGNED-NEXT:    v_lshl_or_b32 v1, v72, 8, v77
+; ALIGNED-NEXT:    v_lshl_or_b32 v1, v72, 8, v76
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1420 ; 4-byte Folded Spill
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v75, 8, v88
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v77, 8, v88
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1424 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x4
@@ -15064,7 +15058,7 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v104, v3, 16, v2
 ; ALIGNED-NEXT:    v_lshl_or_b32 v2, v21, 8, v22
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v23, 8, v20
-; ALIGNED-NEXT:    v_lshl_or_b32 v76, v3, 16, v2
+; ALIGNED-NEXT:    v_lshl_or_b32 v73, v3, 16, v2
 ; ALIGNED-NEXT:    v_lshl_or_b32 v2, v17, 8, v19
 ; ALIGNED-NEXT:    v_lshl_or_b32 v3, v14, 8, v13
 ; ALIGNED-NEXT:    v_lshl_or_b32 v101, v3, 16, v2
@@ -15129,9 +15123,9 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    v_lshl_or_b32 v0, v57, 16, v43
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1412 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    s_clause 0x2
+; ALIGNED-NEXT:    buffer_load_ubyte v57, v4, s[0:3], 0 offen offset:18
 ; ALIGNED-NEXT:    buffer_load_ubyte v78, v4, s[0:3], 0 offen offset:16
-; ALIGNED-NEXT:    buffer_load_ubyte v43, v4, s[0:3], 0 offen offset:18
-; ALIGNED-NEXT:    buffer_load_ubyte v57, v4, s[0:3], 0 offen offset:17
+; ALIGNED-NEXT:    buffer_load_ubyte v43, v4, s[0:3], 0 offen offset:17
 ; ALIGNED-NEXT:    buffer_store_dword v2, off, s[0:3], s32 offset:488
 ; ALIGNED-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:492
 ; ALIGNED-NEXT:    buffer_store_dword v84, off, s[0:3], s32 offset:484
@@ -15140,10 +15134,10 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_dword v2, off, s[0:3], s32 offset:704
 ; ALIGNED-NEXT:    buffer_load_dword v3, off, s[0:3], s32 offset:708
 ; ALIGNED-NEXT:    v_add_nc_u32_e32 v4, 0xffffff00, v4
-; ALIGNED-NEXT:    s_waitcnt vmcnt(3)
-; ALIGNED-NEXT:    v_lshl_or_b32 v0, v126, 8, v43
+; ALIGNED-NEXT:    s_waitcnt vmcnt(4)
+; ALIGNED-NEXT:    v_lshl_or_b32 v0, v126, 8, v57
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(2)
-; ALIGNED-NEXT:    v_lshl_or_b32 v126, v57, 8, v78
+; ALIGNED-NEXT:    v_lshl_or_b32 v126, v43, 8, v78
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(1)
 ; ALIGNED-NEXT:    v_add_co_u32 v2, vcc_lo, v2, s4
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
@@ -15164,7 +15158,7 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v16 offset:246
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v18 offset:244
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v19 offset:240
-; ALIGNED-NEXT:    buffer_store_dword v76, off, s[0:3], s32 offset:504
+; ALIGNED-NEXT:    buffer_store_dword v73, off, s[0:3], s32 offset:504
 ; ALIGNED-NEXT:    buffer_store_dword v104, off, s[0:3], s32 offset:508
 ; ALIGNED-NEXT:    buffer_store_dword v123, off, s[0:3], s32 offset:500
 ; ALIGNED-NEXT:    v_lshl_or_b32 v126, v0, 16, v126
@@ -15294,12 +15288,12 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v62 offset:174
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v63 offset:172
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v59 offset:168
-; ALIGNED-NEXT:    flat_store_byte v[2:3], v73 offset:162
+; ALIGNED-NEXT:    flat_store_byte v[2:3], v75 offset:162
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v74 offset:163
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v79 offset:161
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v72 offset:167
-; ALIGNED-NEXT:    flat_store_byte v[2:3], v75 offset:165
-; ALIGNED-NEXT:    flat_store_byte v[2:3], v77 offset:166
+; ALIGNED-NEXT:    flat_store_byte v[2:3], v77 offset:165
+; ALIGNED-NEXT:    flat_store_byte v[2:3], v76 offset:166
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v88 offset:164
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v89 offset:160
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1408 ; 4-byte Folded Reload
@@ -15798,11 +15792,11 @@ define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:732 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v0 offset:24
-; ALIGNED-NEXT:    flat_store_byte v[2:3], v43 offset:18
+; ALIGNED-NEXT:    flat_store_byte v[2:3], v57 offset:18
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:1416 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v0 offset:19
-; ALIGNED-NEXT:    flat_store_byte v[2:3], v57 offset:17
+; ALIGNED-NEXT:    flat_store_byte v[2:3], v43 offset:17
 ; ALIGNED-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:724 ; 4-byte Folded Reload
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[2:3], v0 offset:23
diff --git a/llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll b/llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll
index 4e5688adcd6bb..f08ea27040fb5 100644
--- a/llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll
+++ b/llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll
@@ -485,12 +485,12 @@ define void @memmove_p0_p3_sz31_align_1_1(ptr addrspace(0) align 1 %dst, ptr add
 ; CHECK-LABEL: memmove_p0_p3_sz31_align_1_1:
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_b32 v8, v2 offset:24
+; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_u16 v10, v2 offset:28
 ; CHECK-NEXT:    ds_read_b64 v[6:7], v2 offset:16
 ; CHECK-NEXT:    ds_read2_b64 v[2:5], v2 offset1:1
-; CHECK-NEXT:    s_waitcnt lgkmcnt(4)
+; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_byte v[0:1], v9 offset:30
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
@@ -540,12 +540,12 @@ define void @memmove_p0_p3_sz31_align_2_2(ptr addrspace(0) align 2 %dst, ptr add
 ; CHECK-LABEL: memmove_p0_p3_sz31_align_2_2:
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_b32 v8, v2 offset:24
+; CHECK-NEXT:    ds_read_u8 v9, v2 offset:30
 ; CHECK-NEXT:    ds_read_u16 v10, v2 offset:28
 ; CHECK-NEXT:    ds_read_b64 v[6:7], v2 offset:16
 ; CHECK-NEXT:    ds_read2_b64 v[2:5], v2 offset1:1
-; CHECK-NEXT:    s_waitcnt lgkmcnt(4)
+; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_byte v[0:1], v9 offset:30
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(3)
 ; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
@@ -939,19 +939,18 @@ define void @memmove_p0_p5_sz31_align_1_1(ptr addrspace(0) align 1 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_byte v[0:1], v11 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    flat_store_short v[0:1], v11 offset:28
+; CHECK-NEXT:    flat_store_byte v[0:1], v10 offset:30
 ; CHECK-NEXT:    flat_store_dwordx3 v[0:1], v[7:9] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
@@ -967,18 +966,18 @@ define void @memmove_p0_p5_sz32_align_1_1(ptr addrspace(0) align 1 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x7
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6] offset:16
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10]
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -1009,19 +1008,18 @@ define void @memmove_p0_p5_sz31_align_2_2(ptr addrspace(0) align 2 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_byte v[0:1], v11 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    flat_store_short v[0:1], v11 offset:28
+; CHECK-NEXT:    flat_store_byte v[0:1], v10 offset:30
 ; CHECK-NEXT:    flat_store_dwordx3 v[0:1], v[7:9] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
@@ -1037,18 +1035,18 @@ define void @memmove_p0_p5_sz32_align_2_2(ptr addrspace(0) align 2 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x7
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:4
-; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:8
-; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6] offset:16
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_dword v10, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10] offset:16
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[7:10]
+; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -1079,20 +1077,20 @@ define void @memmove_p0_p5_sz31_align_8_8(ptr addrspace(0) align 8 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(6)
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
 ; CHECK-NEXT:    flat_store_dwordx3 v[0:1], v[7:9] offset:16
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_byte v[0:1], v11 offset:30
-; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
+; CHECK-NEXT:    flat_store_byte v[0:1], v10 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_short v[0:1], v11 offset:28
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
@@ -1149,20 +1147,20 @@ define void @memmove_p0_p5_sz31_align_16_16(ptr addrspace(0) align 16 %dst, ptr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
-; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(6)
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
 ; CHECK-NEXT:    flat_store_dwordx3 v[0:1], v[7:9] offset:16
-; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    flat_store_byte v[0:1], v11 offset:30
-; CHECK-NEXT:    flat_store_short v[0:1], v10 offset:28
+; CHECK-NEXT:    flat_store_byte v[0:1], v10 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
+; CHECK-NEXT:    flat_store_short v[0:1], v11 offset:28
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[0:1], v[3:6]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
@@ -2079,20 +2077,18 @@ define void @memmove_p1_p5_sz31_align_1_1(ptr addrspace(1) align 1 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    global_store_short v[0:1], v10, off offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(7)
-; CHECK-NEXT:    global_store_byte v[0:1], v11, off offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    global_store_short v[0:1], v11, off offset:28
+; CHECK-NEXT:    global_store_byte v[0:1], v10, off offset:30
 ; CHECK-NEXT:    global_store_dwordx4 v[0:1], v[3:6], off
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx3 v[0:1], v[7:9], off offset:16
@@ -2147,20 +2143,18 @@ define void @memmove_p1_p5_sz31_align_2_2(ptr addrspace(1) align 2 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    global_store_short v[0:1], v10, off offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(7)
-; CHECK-NEXT:    global_store_byte v[0:1], v11, off offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    global_store_short v[0:1], v11, off offset:28
+; CHECK-NEXT:    global_store_byte v[0:1], v10, off offset:30
 ; CHECK-NEXT:    global_store_dwordx4 v[0:1], v[3:6], off
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx3 v[0:1], v[7:9], off offset:16
@@ -2215,20 +2209,18 @@ define void @memmove_p1_p5_sz31_align_8_8(ptr addrspace(1) align 8 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    global_store_short v[0:1], v10, off offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(7)
-; CHECK-NEXT:    global_store_byte v[0:1], v11, off offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    global_store_short v[0:1], v11, off offset:28
+; CHECK-NEXT:    global_store_byte v[0:1], v10, off offset:30
 ; CHECK-NEXT:    global_store_dwordx4 v[0:1], v[3:6], off
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx3 v[0:1], v[7:9], off offset:16
@@ -2283,20 +2275,18 @@ define void @memmove_p1_p5_sz31_align_16_16(ptr addrspace(1) align 16 %dst, ptr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_ushort v10, v2, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v5, v2, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v6, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT:    buffer_load_ushort v11, v2, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v7, v2, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v8, v2, s[0:3], 0 offen offset:20
 ; CHECK-NEXT:    buffer_load_dword v9, v2, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    global_store_short v[0:1], v10, off offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(7)
-; CHECK-NEXT:    global_store_byte v[0:1], v11, off offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    global_store_short v[0:1], v11, off offset:28
+; CHECK-NEXT:    global_store_byte v[0:1], v10, off offset:30
 ; CHECK-NEXT:    global_store_dwordx4 v[0:1], v[3:6], off
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx3 v[0:1], v[7:9], off offset:16
@@ -3266,21 +3256,20 @@ define void @memmove_p3_p5_sz31_align_1_1(ptr addrspace(3) align 1 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_dword v8, v1, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v9, v1, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v10, v1, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v8, v1, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v2, v1, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v3, v1, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v4, v1, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v9, v1, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_ushort v10, v1, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v5, v1, s[0:3], 0 offen offset:12
 ; CHECK-NEXT:    buffer_load_dword v6, v1, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v7, v1, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    ds_write_b32 v0, v8 offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(7)
-; CHECK-NEXT:    ds_write_b16 v0, v9 offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(6)
-; CHECK-NEXT:    ds_write_b8 v0, v10 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(4)
+; CHECK-NEXT:    ds_write_b32 v0, v9 offset:24
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    ds_write_b16 v0, v10 offset:28
+; CHECK-NEXT:    ds_write_b8 v0, v8 offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(2)
 ; CHECK-NEXT:    ds_write2_b64 v0, v[2:3], v[4:5] offset1:1
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
@@ -3339,21 +3328,20 @@ define void @memmove_p3_p5_sz31_align_2_2(ptr addrspace(3) align 2 %dst, ptr add
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_dword v8, v1, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v9, v1, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v10, v1, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v8, v1, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v2, v1, s[0:3], 0 offen
 ; CHECK-NEXT:    buffer_load_dword v3, v1, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_load_dword v4, v1, s[0:3], 0 offen offset:8
+; CHECK-NEXT:    buffer_load_dword v9, v1, s[0:3], 0 offen offset:24
+; CHECK-NEXT:    buffer_load_ushort v10, v1, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v5, v1, s[0:3], 0 offen offset:12
 ; CHECK-NEXT:    buffer_load_dword v6, v1, s[0:3], 0 offen offset:16
 ; CHECK-NEXT:    buffer_load_dword v7, v1, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    s_waitcnt vmcnt(8)
-; CHECK-NEXT:    ds_write_b32 v0, v8 offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(7)
-; CHECK-NEXT:    ds_write_b16 v0, v9 offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(6)
-; CHECK-NEXT:    ds_write_b8 v0, v10 offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(4)
+; CHECK-NEXT:    ds_write_b32 v0, v9 offset:24
+; CHECK-NEXT:    s_waitcnt vmcnt(3)
+; CHECK-NEXT:    ds_write_b16 v0, v10 offset:28
+; CHECK-NEXT:    ds_write_b8 v0, v8 offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(2)
 ; CHECK-NEXT:    ds_write2_b64 v0, v[2:3], v[4:5] offset1:1
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
@@ -3485,22 +3473,21 @@ define void @memmove_p3_p5_sz31_align_16_16(ptr addrspace(3) align 16 %dst, ptr
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_clause 0x8
-; CHECK-NEXT:    buffer_load_dword v6, v1, s[0:3], 0 offen offset:16
-; CHECK-NEXT:    buffer_load_dword v7, v1, s[0:3], 0 offen offset:20
-; CHECK-NEXT:    buffer_load_dword v8, v1, s[0:3], 0 offen offset:24
-; CHECK-NEXT:    buffer_load_ushort v9, v1, s[0:3], 0 offen offset:28
-; CHECK-NEXT:    buffer_load_ubyte v10, v1, s[0:3], 0 offen offset:30
+; CHECK-NEXT:    buffer_load_ubyte v6, v1, s[0:3], 0 offen offset:30
 ; CHECK-NEXT:    buffer_load_dword v2, v1, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v7, v1, s[0:3], 0 offen offset:16
+; CHECK-NEXT:    buffer_load_dword v8, v1, s[0:3], 0 offen offset:20
+; CHECK-NEXT:    buffer_load_dword v9, v1, s[0:3], 0 offen offset:24
 ; CHECK-NEXT:    buffer_load_dword v3, v1, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    buffer_load_ushort v10, v1, s[0:3], 0 offen offset:28
 ; CHECK-NEXT:    buffer_load_dword v4, v1, s[0:3], 0 offen offset:8
 ; CHECK-NEXT:    buffer_load_dword v5, v1, s[0:3], 0 offen offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(6)
-; CHECK-NEXT:    ds_write2_b32 v0, v7, v8 offset0:5 offset1:6
-; CHECK-NEXT:    ds_write_b32 v0, v6 offset:16
-; CHECK-NEXT:    s_waitcnt vmcnt(5)
-; CHECK-NEXT:    ds_write_b16 v0, v9 offset:28
 ; CHECK-NEXT:    s_waitcnt vmcnt(4)
-; CHECK-NEXT:    ds_write_b8 v0, v10 offset:30
+; CHECK-NEXT:    ds_write2_b32 v0, v8, v9 offset0:5 offset1:6
+; CHECK-NEXT:    ds_write_b32 v0, v7 offset:16
+; CHECK-NEXT:    s_waitcnt vmcnt(2)
+; CHECK-NEXT:    ds_write_b16 v0, v10 offset:28
+; CHECK-NEXT:    ds_write_b8 v0, v6 offset:30
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    ds_write_b128 v0, v[2:5]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/min.ll b/llvm/test/CodeGen/AMDGPU/min.ll
index d2f4f54cefe78..86ccacaeaa3c4 100644
--- a/llvm/test/CodeGen/AMDGPU/min.ll
+++ b/llvm/test/CodeGen/AMDGPU/min.ll
@@ -649,14 +649,14 @@ define amdgpu_kernel void @s_test_imin_sle_v4i8(ptr addrspace(1) %out, [8 x i32]
 ;
 ; GFX9-LABEL: s_test_imin_sle_v4i8:
 ; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; GFX9-NEXT:    s_load_dword s3, s[8:9], 0x4c
 ; GFX9-NEXT:    s_load_dword s2, s[8:9], 0x28
-; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; GFX9-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-NEXT:    s_lshr_b32 s5, s2, 16
 ; GFX9-NEXT:    s_lshr_b32 s8, s3, 16
 ; GFX9-NEXT:    s_ashr_i32 s9, s3, 24
+; GFX9-NEXT:    s_lshr_b32 s5, s2, 16
 ; GFX9-NEXT:    s_ashr_i32 s6, s2, 24
 ; GFX9-NEXT:    s_bfe_i32 s8, s8, 0x80000
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s9
diff --git a/llvm/test/CodeGen/AMDGPU/mul.ll b/llvm/test/CodeGen/AMDGPU/mul.ll
index 0f47a31f52dcb..b5e7589cbd134 100644
--- a/llvm/test/CodeGen/AMDGPU/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/mul.ll
@@ -2689,45 +2689,45 @@ define amdgpu_kernel void @s_mul_i128(ptr addrspace(1) %out, [8 x i32], i128 %a,
 ;
 ; GFX9-LABEL: s_mul_i128:
 ; GFX9:       ; %bb.0: ; %entry
-; GFX9-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x4c
-; GFX9-NEXT:    s_load_dwordx4 s[12:15], s[4:5], 0x7c
+; GFX9-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x7c
+; GFX9-NEXT:    s_load_dwordx4 s[12:15], s[4:5], 0x4c
 ; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; GFX9-NEXT:    s_mov_b32 s3, 0xf000
 ; GFX9-NEXT:    s_mov_b32 s2, -1
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-NEXT:    s_mul_i32 s4, s12, s11
-; GFX9-NEXT:    s_mul_hi_u32 s5, s12, s10
-; GFX9-NEXT:    s_mul_i32 s6, s14, s9
-; GFX9-NEXT:    s_mul_hi_u32 s7, s14, s8
+; GFX9-NEXT:    s_mul_i32 s4, s8, s15
+; GFX9-NEXT:    s_mul_hi_u32 s5, s8, s14
+; GFX9-NEXT:    s_mul_i32 s6, s10, s13
+; GFX9-NEXT:    s_mul_hi_u32 s7, s10, s12
 ; GFX9-NEXT:    s_add_i32 s4, s5, s4
-; GFX9-NEXT:    s_mul_i32 s5, s13, s10
+; GFX9-NEXT:    s_mul_i32 s5, s9, s14
 ; GFX9-NEXT:    s_add_i32 s6, s7, s6
-; GFX9-NEXT:    s_mul_i32 s7, s15, s8
+; GFX9-NEXT:    s_mul_i32 s7, s11, s12
 ; GFX9-NEXT:    s_add_i32 s4, s4, s5
-; GFX9-NEXT:    s_mul_i32 s5, s12, s10
+; GFX9-NEXT:    s_mul_i32 s5, s8, s14
 ; GFX9-NEXT:    s_add_i32 s6, s6, s7
-; GFX9-NEXT:    s_mul_i32 s7, s14, s8
+; GFX9-NEXT:    s_mul_i32 s7, s10, s12
 ; GFX9-NEXT:    s_add_u32 s7, s7, s5
 ; GFX9-NEXT:    s_addc_u32 s6, s6, s4
-; GFX9-NEXT:    s_mul_i32 s14, s9, s12
-; GFX9-NEXT:    s_mul_hi_u32 s15, s8, s12
-; GFX9-NEXT:    s_mul_hi_u32 s11, s9, s12
+; GFX9-NEXT:    s_mul_i32 s14, s13, s8
+; GFX9-NEXT:    s_mul_hi_u32 s15, s12, s8
+; GFX9-NEXT:    s_mul_hi_u32 s11, s13, s8
 ; GFX9-NEXT:    s_add_u32 s14, s14, s15
-; GFX9-NEXT:    s_mul_i32 s5, s8, s13
+; GFX9-NEXT:    s_mul_i32 s5, s12, s9
 ; GFX9-NEXT:    s_addc_u32 s11, s11, 0
-; GFX9-NEXT:    s_mul_hi_u32 s10, s8, s13
+; GFX9-NEXT:    s_mul_hi_u32 s10, s12, s9
 ; GFX9-NEXT:    s_add_u32 s5, s5, s14
 ; GFX9-NEXT:    s_addc_u32 s10, s10, 0
 ; GFX9-NEXT:    s_add_u32 s10, s11, s10
 ; GFX9-NEXT:    s_addc_u32 s11, 0, 0
-; GFX9-NEXT:    s_mul_hi_u32 s14, s9, s13
-; GFX9-NEXT:    s_mul_i32 s9, s9, s13
+; GFX9-NEXT:    s_mul_hi_u32 s14, s13, s9
+; GFX9-NEXT:    s_mul_i32 s9, s13, s9
 ; GFX9-NEXT:    s_add_u32 s9, s9, s10
 ; GFX9-NEXT:    s_addc_u32 s10, s14, s11
 ; GFX9-NEXT:    s_mov_b32 s4, 0
 ; GFX9-NEXT:    s_add_u32 s9, s9, s7
 ; GFX9-NEXT:    s_addc_u32 s10, s10, s6
-; GFX9-NEXT:    s_mul_i32 s6, s8, s12
+; GFX9-NEXT:    s_mul_i32 s6, s12, s8
 ; GFX9-NEXT:    s_mov_b32 s7, s4
 ; GFX9-NEXT:    s_or_b64 s[4:5], s[6:7], s[4:5]
 ; GFX9-NEXT:    v_mov_b32_e32 v0, s4
diff --git a/llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll b/llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll
index 151456e82ae51..57805063b92b1 100644
--- a/llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll
+++ b/llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll
@@ -40,8 +40,8 @@ define <2 x i64> @narrow_add_vec(<2 x i64> %a, <2 x i64> %b) #0 {
 ; CHECK-NEXT:    v_and_b32_e32 v3, 0x7ffffffe, v6
 ; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v2
-; CHECK-NEXT:    v_dual_mov_b32 v3, 0 :: v_dual_add_nc_u32 v2, v1, v3
-; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_add_nc_u32 v2, v1, v3
+; CHECK-NEXT:    v_mov_b32_e32 v3, 0
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
   %zext0 = and <2 x i64> %a, <i64 2147483647, i64 30>
   %zext1 = and <2 x i64> %b, <i64 2147483647, i64 2147483646>
diff --git a/llvm/test/CodeGen/AMDGPU/or.ll b/llvm/test/CodeGen/AMDGPU/or.ll
index cc9650b9a7309..1abd2e6b60f2f 100644
--- a/llvm/test/CodeGen/AMDGPU/or.ll
+++ b/llvm/test/CodeGen/AMDGPU/or.ll
@@ -355,15 +355,15 @@ define amdgpu_kernel void @scalar_or_literal_multi_use_i64(ptr addrspace(1) %out
 ;
 ; GFX8-LABEL: scalar_or_literal_multi_use_i64:
 ; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; GFX8-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x4c
+; GFX8-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; GFX8-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x74
 ; GFX8-NEXT:    s_movk_i32 s8, 0x3039
 ; GFX8-NEXT:    s_mov_b32 s9, 0xf237b
-; GFX8-NEXT:    s_mov_b32 s3, 0xf000
 ; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX8-NEXT:    s_or_b64 s[6:7], s[6:7], s[8:9]
 ; GFX8-NEXT:    v_mov_b32_e32 v0, s6
+; GFX8-NEXT:    s_mov_b32 s3, 0xf000
 ; GFX8-NEXT:    s_mov_b32 s2, -1
 ; GFX8-NEXT:    v_mov_b32_e32 v1, s7
 ; GFX8-NEXT:    buffer_store_dwordx2 v[0:1], off, s[0:3], 0
diff --git a/llvm/test/CodeGen/AMDGPU/permute_i8.ll b/llvm/test/CodeGen/AMDGPU/permute_i8.ll
index 120aebf2bf7c8..a4ddfee115fa6 100644
--- a/llvm/test/CodeGen/AMDGPU/permute_i8.ll
+++ b/llvm/test/CodeGen/AMDGPU/permute_i8.ll
@@ -1944,71 +1944,71 @@ define hidden void @srem_store_div(ptr addrspace(1) %in0, ptr addrspace(1) %in1,
 ; GFX9-NEXT:    global_load_dword v9, v[0:1], off
 ; GFX9-NEXT:    s_mov_b32 s4, 0x2070306
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    v_cvt_f32_i32_sdwa v3, sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0
+; GFX9-NEXT:    v_cvt_f32_i32_sdwa v10, sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2
 ; GFX9-NEXT:    v_cvt_f32_i32_sdwa v14, sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_1
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_cvt_f32_i32_sdwa v13, sext(v9) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3
-; GFX9-NEXT:    v_cvt_f32_i32_sdwa v10, sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2
-; GFX9-NEXT:    v_cvt_f32_i32_sdwa v3, sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0
-; GFX9-NEXT:    v_rcp_iflag_f32_e32 v18, v14
+; GFX9-NEXT:    v_rcp_iflag_f32_e32 v17, v3
 ; GFX9-NEXT:    v_cvt_f32_i32_sdwa v16, sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3
-; GFX9-NEXT:    v_rcp_iflag_f32_e32 v19, v10
+; GFX9-NEXT:    v_xor_b32_sdwa v15, sext(v4), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:BYTE_2
 ; GFX9-NEXT:    v_perm_b32 v1, v4, v9, s4
-; GFX9-NEXT:    v_mul_f32_e32 v18, v13, v18
-; GFX9-NEXT:    v_trunc_f32_e32 v18, v18
-; GFX9-NEXT:    v_mad_f32 v13, -v18, v14, v13
-; GFX9-NEXT:    v_cmp_ge_f32_e64 vcc, |v13|, |v14|
-; GFX9-NEXT:    v_rcp_iflag_f32_e32 v13, v3
-; GFX9-NEXT:    v_mul_f32_e32 v14, v16, v19
-; GFX9-NEXT:    v_trunc_f32_e32 v14, v14
-; GFX9-NEXT:    v_mad_f32 v19, -v14, v10, v16
-; GFX9-NEXT:    v_mul_f32_e32 v13, v10, v13
-; GFX9-NEXT:    v_trunc_f32_e32 v13, v13
-; GFX9-NEXT:    v_cmp_ge_f32_e64 s[4:5], |v19|, |v10|
-; GFX9-NEXT:    v_mad_f32 v10, -v13, v3, v10
-; GFX9-NEXT:    v_cvt_f32_i32_sdwa v19, sext(v9) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2
-; GFX9-NEXT:    v_cmp_ge_f32_e64 s[6:7], |v10|, |v3|
-; GFX9-NEXT:    v_rcp_iflag_f32_e32 v3, v16
-; GFX9-NEXT:    v_xor_b32_sdwa v12, sext(v9), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:BYTE_1
+; GFX9-NEXT:    v_mul_f32_e32 v17, v10, v17
+; GFX9-NEXT:    v_trunc_f32_e32 v17, v17
+; GFX9-NEXT:    v_mad_f32 v19, -v17, v3, v10
+; GFX9-NEXT:    v_cmp_ge_f32_e64 vcc, |v19|, |v3|
+; GFX9-NEXT:    v_rcp_iflag_f32_e32 v3, v14
+; GFX9-NEXT:    v_rcp_iflag_f32_e32 v19, v10
 ; GFX9-NEXT:    v_xor_b32_sdwa v2, sext(v4), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2 src1_sel:BYTE_0
-; GFX9-NEXT:    v_xor_b32_sdwa v15, sext(v4), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:BYTE_2
-; GFX9-NEXT:    v_mul_f32_e32 v3, v19, v3
+; GFX9-NEXT:    v_xor_b32_sdwa v12, sext(v9), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:BYTE_1
+; GFX9-NEXT:    v_mul_f32_e32 v3, v13, v3
 ; GFX9-NEXT:    v_trunc_f32_e32 v3, v3
-; GFX9-NEXT:    v_ashrrev_i32_e32 v12, 30, v12
-; GFX9-NEXT:    v_xor_b32_sdwa v10, sext(v9), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2 src1_sel:BYTE_3
-; GFX9-NEXT:    v_cvt_i32_f32_e32 v13, v13
-; GFX9-NEXT:    v_cvt_i32_f32_e32 v18, v18
-; GFX9-NEXT:    v_cvt_i32_f32_e32 v14, v14
-; GFX9-NEXT:    v_mad_f32 v19, -v3, v16, v19
-; GFX9-NEXT:    v_cvt_i32_f32_e32 v3, v3
-; GFX9-NEXT:    v_ashrrev_i32_e32 v15, 30, v15
-; GFX9-NEXT:    v_or_b32_e32 v12, 1, v12
+; GFX9-NEXT:    v_mad_f32 v13, -v3, v14, v13
+; GFX9-NEXT:    v_cmp_ge_f32_e64 s[4:5], |v13|, |v14|
+; GFX9-NEXT:    v_ashrrev_i32_e32 v14, 30, v15
+; GFX9-NEXT:    v_mul_f32_e32 v15, v16, v19
+; GFX9-NEXT:    v_trunc_f32_e32 v15, v15
+; GFX9-NEXT:    v_mad_f32 v19, -v15, v10, v16
+; GFX9-NEXT:    v_cvt_f32_i32_sdwa v13, sext(v9) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2
+; GFX9-NEXT:    v_cmp_ge_f32_e64 s[6:7], |v19|, |v10|
+; GFX9-NEXT:    v_rcp_iflag_f32_e32 v10, v16
 ; GFX9-NEXT:    v_ashrrev_i32_e32 v2, 30, v2
-; GFX9-NEXT:    v_ashrrev_i32_e32 v10, 30, v10
-; GFX9-NEXT:    v_or_b32_e32 v15, 1, v15
+; GFX9-NEXT:    v_xor_b32_sdwa v19, sext(v9), sext(v4) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_2 src1_sel:BYTE_3
+; GFX9-NEXT:    v_cvt_i32_f32_e32 v17, v17
+; GFX9-NEXT:    v_mul_f32_e32 v10, v13, v10
+; GFX9-NEXT:    v_trunc_f32_e32 v10, v10
+; GFX9-NEXT:    v_cvt_i32_f32_e32 v3, v3
+; GFX9-NEXT:    v_cvt_i32_f32_e32 v15, v15
+; GFX9-NEXT:    v_mad_f32 v13, -v10, v16, v13
+; GFX9-NEXT:    v_cvt_i32_f32_e32 v10, v10
 ; GFX9-NEXT:    v_or_b32_e32 v2, 1, v2
-; GFX9-NEXT:    v_or_b32_e32 v10, 1, v10
-; GFX9-NEXT:    v_cndmask_b32_e32 v12, 0, v12, vcc
-; GFX9-NEXT:    v_cmp_ge_f32_e64 vcc, |v19|, |v16|
-; GFX9-NEXT:    v_cndmask_b32_e64 v2, 0, v2, s[6:7]
-; GFX9-NEXT:    v_cndmask_b32_e64 v15, 0, v15, s[4:5]
-; GFX9-NEXT:    v_cndmask_b32_e32 v10, 0, v10, vcc
+; GFX9-NEXT:    v_ashrrev_i32_e32 v12, 30, v12
+; GFX9-NEXT:    v_ashrrev_i32_e32 v19, 30, v19
+; GFX9-NEXT:    v_or_b32_e32 v12, 1, v12
+; GFX9-NEXT:    v_or_b32_e32 v14, 1, v14
+; GFX9-NEXT:    v_or_b32_e32 v19, 1, v19
+; GFX9-NEXT:    v_cndmask_b32_e32 v2, 0, v2, vcc
+; GFX9-NEXT:    v_cmp_ge_f32_e64 vcc, |v13|, |v16|
+; GFX9-NEXT:    v_cndmask_b32_e64 v12, 0, v12, s[4:5]
+; GFX9-NEXT:    v_cndmask_b32_e64 v14, 0, v14, s[6:7]
+; GFX9-NEXT:    v_cndmask_b32_e32 v13, 0, v19, vcc
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v0, 16, v4
 ; GFX9-NEXT:    v_lshrrev_b32_e32 v11, 8, v4
-; GFX9-NEXT:    v_lshrrev_b32_e32 v17, 24, v4
-; GFX9-NEXT:    v_add_u32_e32 v2, v13, v2
-; GFX9-NEXT:    v_add_u32_e32 v12, v18, v12
-; GFX9-NEXT:    v_add_u32_e32 v13, v14, v15
-; GFX9-NEXT:    v_add_u32_e32 v3, v3, v10
+; GFX9-NEXT:    v_lshrrev_b32_e32 v18, 24, v4
+; GFX9-NEXT:    v_add_u32_e32 v2, v17, v2
+; GFX9-NEXT:    v_add_u32_e32 v3, v3, v12
+; GFX9-NEXT:    v_add_u32_e32 v12, v15, v14
+; GFX9-NEXT:    v_add_u32_e32 v10, v10, v13
 ; GFX9-NEXT:    v_mul_lo_u32 v2, v2, v4
-; GFX9-NEXT:    v_mul_lo_u32 v4, v12, v11
-; GFX9-NEXT:    v_mul_lo_u32 v10, v13, v0
-; GFX9-NEXT:    v_mul_lo_u32 v3, v3, v17
+; GFX9-NEXT:    v_mul_lo_u32 v3, v3, v11
+; GFX9-NEXT:    v_mul_lo_u32 v4, v12, v0
+; GFX9-NEXT:    v_mul_lo_u32 v10, v10, v18
 ; GFX9-NEXT:    v_sub_u32_e32 v0, v0, v2
-; GFX9-NEXT:    v_sub_u32_sdwa v2, v9, v4 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
-; GFX9-NEXT:    v_sub_u32_e32 v4, v17, v10
-; GFX9-NEXT:    v_sub_u32_sdwa v3, v9, v3 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX9-NEXT:    v_sub_u32_sdwa v2, v9, v3 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
+; GFX9-NEXT:    v_sub_u32_e32 v3, v18, v4
+; GFX9-NEXT:    v_sub_u32_sdwa v4, v9, v10 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX9-NEXT:    v_or_b32_sdwa v2, v4, v3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX9-NEXT:    v_or_b32_sdwa v2, v3, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GFX9-NEXT:    v_or_b32_sdwa v0, v0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX9-NEXT:    global_store_dword v[5:6], v0, off
 ; GFX9-NEXT:    global_store_dword v[7:8], v1, off
@@ -3656,14 +3656,14 @@ define hidden void @extract_v6i16(ptr addrspace(1) %in0, ptr addrspace(1) %in1,
 ; GFX9-LABEL: extract_v6i16:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    global_load_ushort v2, v[0:1], off offset:6
-; GFX9-NEXT:    global_load_ushort v3, v[0:1], off
-; GFX9-NEXT:    global_load_ushort v8, v[0:1], off offset:4
+; GFX9-NEXT:    global_load_ushort v2, v[0:1], off offset:4
+; GFX9-NEXT:    global_load_ushort v3, v[0:1], off offset:6
+; GFX9-NEXT:    global_load_ushort v8, v[0:1], off
 ; GFX9-NEXT:    global_load_ushort v9, v[0:1], off offset:2
-; GFX9-NEXT:    s_waitcnt vmcnt(1)
-; GFX9-NEXT:    v_lshl_or_b32 v0, v2, 16, v8
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
+; GFX9-NEXT:    v_lshl_or_b32 v0, v3, 16, v2
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-NEXT:    v_lshl_or_b32 v1, v9, 16, v3
+; GFX9-NEXT:    v_lshl_or_b32 v1, v9, 16, v8
 ; GFX9-NEXT:    global_store_dword v[4:5], v1, off
 ; GFX9-NEXT:    global_store_dword v[6:7], v0, off
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/pr51516.mir b/llvm/test/CodeGen/AMDGPU/pr51516.mir
index 81925de8910f8..69983faf2b154 100644
--- a/llvm/test/CodeGen/AMDGPU/pr51516.mir
+++ b/llvm/test/CodeGen/AMDGPU/pr51516.mir
@@ -1,3 +1,4 @@
+# NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 # RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -amdgpu-disable-unclustered-high-rp-reschedule -verify-misched -start-before=machine-scheduler -stop-after=virtregrewriter,2 -o - %s | FileCheck -check-prefix=GCN %s
 # RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -amdgpu-disable-unclustered-high-rp-reschedule -amdgpu-use-amdgpu-trackers=1 -verify-misched -start-before=machine-scheduler -stop-after=virtregrewriter,2 -o - %s | FileCheck -check-prefix=GCN-GCNTRACKER %s
 
@@ -6,7 +7,7 @@
 
 # GCN-LABEL: name: global_sextload_v32i32_to_v32i64
 # GCN: renamable $vgpr34_vgpr35_vgpr36_vgpr37 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
-# GCN: GLOBAL_STORE_DWORDX4_SADDR killed renamable $vgpr47, killed renamable $vgpr26_vgpr27_vgpr28_vgpr29, killed renamable $sgpr0_sgpr1, 16, 0, implicit $exec, implicit killed renamable $vgpr46
+# GCN: GLOBAL_STORE_DWORDX4_SADDR killed renamable $vgpr5, killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1, 16, 0, implicit $exec, implicit killed renamable $vgpr4
 
 # GCN-GCNTRACKER-LABEL: name: global_sextload_v32i32_to_v32i64
 # GCN-GCNTRACKER-NOT: SI_SPILL
@@ -116,3 +117,6 @@ body:             |
     S_ENDPGM 0
 
 ...
+## NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+# GCN: {{.*}}
+# GCN-GCNTRACKER: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
index e452af7d60c0c..c4842c1f4f523 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
@@ -748,18 +748,18 @@ define hidden amdgpu_kernel void @clmem_read(ptr addrspace(1)  %buffer) {
 ; GFX90A-NEXT:    global_load_dwordx2 v[12:13], v[12:13], off
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v15, vcc, -1, v7, vcc
 ; GFX90A-NEXT:    global_load_dwordx2 v[18:19], v[14:15], off offset:-2048
+; GFX90A-NEXT:    global_load_dwordx2 v[20:21], v[14:15], off
 ; GFX90A-NEXT:    v_add_co_u32_e32 v16, vcc, s0, v6
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v17, vcc, -1, v7, vcc
 ; GFX90A-NEXT:    global_load_dwordx2 v[16:17], v[16:17], off offset:-2048
-; GFX90A-NEXT:    v_add_co_u32_e32 v20, vcc, s1, v6
-; GFX90A-NEXT:    global_load_dwordx2 v[14:15], v[14:15], off
-; GFX90A-NEXT:    v_addc_co_u32_e32 v21, vcc, -1, v7, vcc
-; GFX90A-NEXT:    global_load_dwordx2 v[24:25], v[20:21], off offset:-4096
-; GFX90A-NEXT:    global_load_dwordx2 v[26:27], v[20:21], off offset:-2048
-; GFX90A-NEXT:    global_load_dwordx2 v[28:29], v[20:21], off
+; GFX90A-NEXT:    v_add_co_u32_e32 v14, vcc, s1, v6
+; GFX90A-NEXT:    v_addc_co_u32_e32 v15, vcc, -1, v7, vcc
+; GFX90A-NEXT:    global_load_dwordx2 v[24:25], v[14:15], off offset:-4096
+; GFX90A-NEXT:    global_load_dwordx2 v[26:27], v[14:15], off offset:-2048
+; GFX90A-NEXT:    global_load_dwordx2 v[28:29], v[14:15], off
 ; GFX90A-NEXT:    v_add_co_u32_e32 v22, vcc, s2, v6
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v23, vcc, -1, v7, vcc
-; GFX90A-NEXT:    global_load_dwordx2 v[20:21], v[22:23], off offset:-2048
+; GFX90A-NEXT:    global_load_dwordx2 v[14:15], v[22:23], off offset:-2048
 ; GFX90A-NEXT:    global_load_dwordx2 v[30:31], v[6:7], off
 ; GFX90A-NEXT:    v_add_co_u32_e32 v6, vcc, 0x10000, v6
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v7, vcc, 0, v7, vcc
@@ -771,9 +771,10 @@ define hidden amdgpu_kernel void @clmem_read(ptr addrspace(1)  %buffer) {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(7)
 ; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v18, v1
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v19, v4, vcc
+; GFX90A-NEXT:    s_waitcnt vmcnt(6)
+; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v20, v1
+; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v21, v4, vcc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(5)
-; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v14, v1
-; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v15, v4, vcc
 ; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v16, v1
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v17, v4, vcc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(4)
@@ -786,8 +787,8 @@ define hidden amdgpu_kernel void @clmem_read(ptr addrspace(1)  %buffer) {
 ; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v28, v1
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v29, v4, vcc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(1)
-; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v20, v1
-; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v21, v4, vcc
+; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v14, v1
+; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v15, v4, vcc
 ; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v8, v1
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v4, vcc, v9, v4, vcc
 ; GFX90A-NEXT:    v_add_co_u32_e32 v1, vcc, v10, v1
@@ -847,14 +848,13 @@ define hidden amdgpu_kernel void @clmem_read(ptr addrspace(1)  %buffer) {
 ; GFX11-NEXT:    v_add_co_u32 v7, vcc_lo, v4, 0xffffc000
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v8, null, -1, v5, vcc_lo
 ; GFX11-NEXT:    v_add_co_u32 v9, vcc_lo, 0xffffc000, v4
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v10, null, -1, v5, vcc_lo
 ; GFX11-NEXT:    global_load_b64 v[13:14], v[7:8], off offset:-4096
 ; GFX11-NEXT:    v_add_co_u32 v11, vcc_lo, 0xffffd000, v4
-; GFX11-NEXT:    global_load_b64 v[9:10], v[9:10], off offset:-2048
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v12, null, -1, v5, vcc_lo
 ; GFX11-NEXT:    v_add_co_u32 v15, vcc_lo, v4, 0xffffe000
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    global_load_b64 v[9:10], v[9:10], off offset:-2048
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v16, null, -1, v5, vcc_lo
 ; GFX11-NEXT:    global_load_b64 v[11:12], v[11:12], off offset:-2048
 ; GFX11-NEXT:    v_add_co_u32 v17, vcc_lo, 0xffffe000, v4
@@ -1193,9 +1193,9 @@ define amdgpu_kernel void @Address32(ptr addrspace(1) %buffer) {
 ; GFX10-NEXT:    s_swappc_b64 s[30:31], s[6:7]
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 7, v0
 ; GFX10-NEXT:    v_mov_b32_e32 v2, 2
-; GFX10-NEXT:    v_and_b32_e32 v8, 0xffff8000, v1
+; GFX10-NEXT:    v_and_b32_e32 v10, 0xffff8000, v1
 ; GFX10-NEXT:    v_lshlrev_b32_sdwa v0, v2, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX10-NEXT:    v_add_co_u32 v1, s0, s34, v8
+; GFX10-NEXT:    v_add_co_u32 v1, s0, s34, v10
 ; GFX10-NEXT:    v_add_co_ci_u32_e64 v2, s0, s35, 0, s0
 ; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, v1, v0
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v2, vcc_lo
@@ -1203,38 +1203,38 @@ define amdgpu_kernel void @Address32(ptr addrspace(1) %buffer) {
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v3, vcc_lo, 0, v1, vcc_lo
 ; GFX10-NEXT:    v_add_co_u32 v4, vcc_lo, v0, 0x1000
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v5, vcc_lo, 0, v1, vcc_lo
+; GFX10-NEXT:    s_clause 0x3
+; GFX10-NEXT:    global_load_dword v11, v[0:1], off
+; GFX10-NEXT:    global_load_dword v12, v[0:1], off offset:1024
+; GFX10-NEXT:    global_load_dword v13, v[4:5], off offset:-2048
+; GFX10-NEXT:    global_load_dword v14, v[2:3], off offset:1024
 ; GFX10-NEXT:    v_add_co_u32 v6, vcc_lo, 0x1000, v0
-; GFX10-NEXT:    s_clause 0x4
-; GFX10-NEXT:    global_load_dword v9, v[0:1], off
-; GFX10-NEXT:    global_load_dword v10, v[0:1], off offset:1024
-; GFX10-NEXT:    global_load_dword v11, v[2:3], off offset:1024
-; GFX10-NEXT:    global_load_dword v12, v[4:5], off offset:-2048
-; GFX10-NEXT:    global_load_dword v13, v[4:5], off
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v7, vcc_lo, 0, v1, vcc_lo
 ; GFX10-NEXT:    v_add_co_u32 v2, vcc_lo, 0x1800, v0
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v3, vcc_lo, 0, v1, vcc_lo
-; GFX10-NEXT:    v_add_co_u32 v4, vcc_lo, v0, 0x2000
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v5, vcc_lo, 0, v1, vcc_lo
-; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    global_load_dword v14, v[6:7], off offset:1024
-; GFX10-NEXT:    global_load_dword v15, v[2:3], off offset:1024
+; GFX10-NEXT:    v_add_co_u32 v8, vcc_lo, v0, 0x2000
+; GFX10-NEXT:    v_add_co_ci_u32_e32 v9, vcc_lo, 0, v1, vcc_lo
+; GFX10-NEXT:    s_clause 0x2
+; GFX10-NEXT:    global_load_dword v15, v[4:5], off
+; GFX10-NEXT:    global_load_dword v16, v[6:7], off offset:1024
+; GFX10-NEXT:    global_load_dword v17, v[2:3], off offset:1024
 ; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, 0x2000, v0
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v1, vcc_lo
 ; GFX10-NEXT:    s_clause 0x2
-; GFX10-NEXT:    global_load_dword v2, v[4:5], off offset:-2048
-; GFX10-NEXT:    global_load_dword v3, v[4:5], off
-; GFX10-NEXT:    global_load_dword v6, v[0:1], off offset:1024
+; GFX10-NEXT:    global_load_dword v2, v[8:9], off offset:-2048
+; GFX10-NEXT:    global_load_dword v3, v[8:9], off
+; GFX10-NEXT:    global_load_dword v4, v[0:1], off offset:1024
 ; GFX10-NEXT:    s_waitcnt vmcnt(8)
-; GFX10-NEXT:    v_add_nc_u32_e32 v0, v10, v9
+; GFX10-NEXT:    v_add_nc_u32_e32 v0, v12, v11
 ; GFX10-NEXT:    s_waitcnt vmcnt(6)
-; GFX10-NEXT:    v_add3_u32 v0, v12, v0, v11
-; GFX10-NEXT:    s_waitcnt vmcnt(4)
 ; GFX10-NEXT:    v_add3_u32 v0, v13, v0, v14
+; GFX10-NEXT:    s_waitcnt vmcnt(4)
+; GFX10-NEXT:    v_add3_u32 v0, v15, v0, v16
 ; GFX10-NEXT:    s_waitcnt vmcnt(2)
-; GFX10-NEXT:    v_add3_u32 v0, v2, v0, v15
+; GFX10-NEXT:    v_add3_u32 v0, v2, v0, v17
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_add3_u32 v0, v3, v0, v6
-; GFX10-NEXT:    global_store_dword v8, v0, s[34:35]
+; GFX10-NEXT:    v_add3_u32 v0, v3, v0, v4
+; GFX10-NEXT:    global_store_dword v10, v0, s[34:35]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: Address32:
@@ -1375,19 +1375,19 @@ define amdgpu_kernel void @Offset64(ptr addrspace(1)  %buffer) {
 ; GFX8-NEXT:    v_add_u32_e32 v3, vcc, v1, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v4, vcc, 0, v2, vcc
 ; GFX8-NEXT:    s_movk_i32 s0, 0xf000
-; GFX8-NEXT:    v_add_u32_e32 v7, vcc, s0, v3
-; GFX8-NEXT:    v_addc_u32_e32 v8, vcc, 0, v4, vcc
+; GFX8-NEXT:    v_add_u32_e32 v5, vcc, s0, v3
+; GFX8-NEXT:    v_addc_u32_e32 v6, vcc, 0, v4, vcc
 ; GFX8-NEXT:    s_movk_i32 s0, 0xf800
-; GFX8-NEXT:    flat_load_dwordx2 v[5:6], v[3:4]
-; GFX8-NEXT:    flat_load_dwordx2 v[7:8], v[7:8]
+; GFX8-NEXT:    flat_load_dwordx2 v[7:8], v[3:4]
+; GFX8-NEXT:    flat_load_dwordx2 v[5:6], v[5:6]
 ; GFX8-NEXT:    v_add_u32_e32 v9, vcc, s0, v3
 ; GFX8-NEXT:    v_addc_u32_e32 v10, vcc, 0, v4, vcc
 ; GFX8-NEXT:    flat_load_dwordx2 v[9:10], v[9:10]
 ; GFX8-NEXT:    v_add_u32_e32 v4, vcc, 1, v4
 ; GFX8-NEXT:    flat_load_dwordx2 v[3:4], v[3:4]
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v7, v5
-; GFX8-NEXT:    v_addc_u32_e32 v5, vcc, v8, v6, vcc
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v5, v7
+; GFX8-NEXT:    v_addc_u32_e32 v5, vcc, v6, v8, vcc
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v9, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v5, vcc, v10, v5, vcc
@@ -1429,14 +1429,14 @@ define amdgpu_kernel void @Offset64(ptr addrspace(1)  %buffer) {
 ; GFX9-NEXT:    s_movk_i32 s0, 0xf000
 ; GFX9-NEXT:    v_add_co_u32_e32 v2, vcc, s0, v0
 ; GFX9-NEXT:    v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
-; GFX9-NEXT:    global_load_dwordx2 v[4:5], v[2:3], off
-; GFX9-NEXT:    global_load_dwordx2 v[6:7], v[0:1], off
+; GFX9-NEXT:    global_load_dwordx2 v[4:5], v[0:1], off
+; GFX9-NEXT:    global_load_dwordx2 v[6:7], v[2:3], off
 ; GFX9-NEXT:    global_load_dwordx2 v[8:9], v[2:3], off offset:2048
 ; GFX9-NEXT:    v_add_u32_e32 v1, 1, v1
 ; GFX9-NEXT:    global_load_dwordx2 v[0:1], v[0:1], off
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_add_co_u32_e32 v2, vcc, v4, v6
-; GFX9-NEXT:    v_addc_co_u32_e32 v3, vcc, v5, v7, vcc
+; GFX9-NEXT:    v_add_co_u32_e32 v2, vcc, v6, v4
+; GFX9-NEXT:    v_addc_co_u32_e32 v3, vcc, v7, v5, vcc
 ; GFX9-NEXT:    s_waitcnt vmcnt(1)
 ; GFX9-NEXT:    v_add_co_u32_e32 v2, vcc, v8, v2
 ; GFX9-NEXT:    v_addc_co_u32_e32 v3, vcc, v9, v3, vcc
@@ -1521,15 +1521,15 @@ define amdgpu_kernel void @Offset64(ptr addrspace(1)  %buffer) {
 ; GFX11-NEXT:    v_add_co_u32 v2, vcc_lo, 0xfffff000, v0
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v3, null, 0, v1, vcc_lo
 ; GFX11-NEXT:    s_clause 0x2
-; GFX11-NEXT:    global_load_b64 v[4:5], v[2:3], off
-; GFX11-NEXT:    global_load_b64 v[6:7], v[0:1], off
+; GFX11-NEXT:    global_load_b64 v[4:5], v[0:1], off
+; GFX11-NEXT:    global_load_b64 v[6:7], v[2:3], off
 ; GFX11-NEXT:    global_load_b64 v[2:3], v[2:3], off offset:2048
 ; GFX11-NEXT:    v_add_nc_u32_e32 v1, 1, v1
 ; GFX11-NEXT:    global_load_b64 v[0:1], v[0:1], off
 ; GFX11-NEXT:    s_waitcnt vmcnt(2)
-; GFX11-NEXT:    v_add_co_u32 v4, vcc_lo, v4, v6
+; GFX11-NEXT:    v_add_co_u32 v4, vcc_lo, v6, v4
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_add_co_ci_u32_e64 v5, null, v5, v7, vcc_lo
+; GFX11-NEXT:    v_add_co_ci_u32_e64 v5, null, v7, v5, vcc_lo
 ; GFX11-NEXT:    s_waitcnt vmcnt(1)
 ; GFX11-NEXT:    v_add_co_u32 v2, vcc_lo, v2, v4
 ; GFX11-NEXT:    v_add_co_ci_u32_e64 v3, null, v3, v5, vcc_lo
@@ -1686,26 +1686,26 @@ define amdgpu_kernel void @p32Offset64(ptr addrspace(1)  %buffer) {
 ; GFX10-NEXT:    s_swappc_b64 s[30:31], s[6:7]
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 7, v0
 ; GFX10-NEXT:    v_mov_b32_e32 v2, 2
-; GFX10-NEXT:    v_and_b32_e32 v4, 0xffff8000, v1
+; GFX10-NEXT:    v_and_b32_e32 v6, 0xffff8000, v1
 ; GFX10-NEXT:    v_lshlrev_b32_sdwa v0, v2, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX10-NEXT:    v_add_co_u32 v1, s0, s34, v4
+; GFX10-NEXT:    v_add_co_u32 v1, s0, s34, v6
 ; GFX10-NEXT:    v_add_co_ci_u32_e64 v2, s0, s35, 0, s0
 ; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, v1, v0
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v2, vcc_lo
 ; GFX10-NEXT:    v_add_co_u32 v2, vcc_lo, v0, 0x80000000
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v3, vcc_lo, 0, v1, vcc_lo
-; GFX10-NEXT:    global_load_dword v5, v[0:1], off
-; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, 0x7ffff800, v0
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v1, vcc_lo
-; GFX10-NEXT:    s_clause 0x2
-; GFX10-NEXT:    global_load_dword v6, v[2:3], off offset:-2048
-; GFX10-NEXT:    global_load_dword v7, v[2:3], off
-; GFX10-NEXT:    global_load_dword v8, v[0:1], off offset:1024
+; GFX10-NEXT:    v_add_co_u32 v4, vcc_lo, 0x7ffff800, v0
+; GFX10-NEXT:    v_add_co_ci_u32_e32 v5, vcc_lo, 0, v1, vcc_lo
+; GFX10-NEXT:    s_clause 0x3
+; GFX10-NEXT:    global_load_dword v7, v[0:1], off
+; GFX10-NEXT:    global_load_dword v8, v[2:3], off offset:-2048
+; GFX10-NEXT:    global_load_dword v9, v[2:3], off
+; GFX10-NEXT:    global_load_dword v10, v[4:5], off offset:1024
 ; GFX10-NEXT:    s_waitcnt vmcnt(2)
-; GFX10-NEXT:    v_add_nc_u32_e32 v0, v6, v5
+; GFX10-NEXT:    v_add_nc_u32_e32 v0, v8, v7
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_add3_u32 v0, v8, v0, v7
-; GFX10-NEXT:    global_store_dword v4, v0, s[34:35]
+; GFX10-NEXT:    v_add3_u32 v0, v10, v0, v9
+; GFX10-NEXT:    global_store_dword v6, v0, s[34:35]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: p32Offset64:
@@ -2160,25 +2160,25 @@ define amdgpu_kernel void @ReverseOrder(ptr addrspace(1) %buffer) {
 ; GFX9-NEXT:    v_add_co_u32_e32 v0, vcc, v2, v0
 ; GFX9-NEXT:    v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
 ; GFX9-NEXT:    s_movk_i32 s0, 0x3000
-; GFX9-NEXT:    v_add_co_u32_e32 v4, vcc, s0, v0
-; GFX9-NEXT:    global_load_dwordx2 v[2:3], v[0:1], off
-; GFX9-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
-; GFX9-NEXT:    global_load_dwordx2 v[6:7], v[4:5], off offset:2048
-; GFX9-NEXT:    global_load_dwordx2 v[8:9], v[4:5], off
+; GFX9-NEXT:    v_add_co_u32_e32 v2, vcc, s0, v0
+; GFX9-NEXT:    v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
+; GFX9-NEXT:    global_load_dwordx2 v[4:5], v[0:1], off
+; GFX9-NEXT:    global_load_dwordx2 v[6:7], v[2:3], off offset:2048
+; GFX9-NEXT:    global_load_dwordx2 v[8:9], v[2:3], off
 ; GFX9-NEXT:    s_movk_i32 s0, 0x2000
-; GFX9-NEXT:    v_add_co_u32_e32 v4, vcc, s0, v0
-; GFX9-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
-; GFX9-NEXT:    global_load_dwordx2 v[10:11], v[4:5], off offset:2048
+; GFX9-NEXT:    v_add_co_u32_e32 v2, vcc, s0, v0
+; GFX9-NEXT:    v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
+; GFX9-NEXT:    global_load_dwordx2 v[10:11], v[2:3], off offset:2048
 ; GFX9-NEXT:    s_movk_i32 s0, 0x1000
 ; GFX9-NEXT:    v_add_co_u32_e32 v12, vcc, s0, v0
 ; GFX9-NEXT:    v_addc_co_u32_e32 v13, vcc, 0, v1, vcc
 ; GFX9-NEXT:    global_load_dwordx2 v[14:15], v[12:13], off
-; GFX9-NEXT:    global_load_dwordx2 v[16:17], v[4:5], off
+; GFX9-NEXT:    global_load_dwordx2 v[16:17], v[2:3], off
 ; GFX9-NEXT:    global_load_dwordx2 v[18:19], v[12:13], off offset:2048
 ; GFX9-NEXT:    global_load_dwordx2 v[20:21], v[0:1], off offset:2048
 ; GFX9-NEXT:    s_waitcnt vmcnt(6)
-; GFX9-NEXT:    v_add_co_u32_e32 v0, vcc, v6, v2
-; GFX9-NEXT:    v_addc_co_u32_e32 v1, vcc, v7, v3, vcc
+; GFX9-NEXT:    v_add_co_u32_e32 v0, vcc, v6, v4
+; GFX9-NEXT:    v_addc_co_u32_e32 v1, vcc, v7, v5, vcc
 ; GFX9-NEXT:    s_waitcnt vmcnt(5)
 ; GFX9-NEXT:    v_add_co_u32_e32 v0, vcc, v8, v0
 ; GFX9-NEXT:    v_addc_co_u32_e32 v1, vcc, v9, v1, vcc
diff --git a/llvm/test/CodeGen/AMDGPU/repeated-divisor.ll b/llvm/test/CodeGen/AMDGPU/repeated-divisor.ll
index 04eea20993608..d34d2050b157e 100644
--- a/llvm/test/CodeGen/AMDGPU/repeated-divisor.ll
+++ b/llvm/test/CodeGen/AMDGPU/repeated-divisor.ll
@@ -232,8 +232,8 @@ define <2 x float> @v_repeat_divisor_f32_x2_arcp_daz(float %x, float %y, float %
 ; GFX11-NEXT:    v_div_fmas_f32 v3, v3, v4, v6
 ; GFX11-NEXT:    v_div_fixup_f32 v2, v3, v2, 1.0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; GFX11-NEXT:    v_mul_f32_e32 v0, v0, v2
+; GFX11-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %div0 = fdiv arcp float %x, %D
   %div1 = fdiv arcp float %y, %D
@@ -434,8 +434,8 @@ define <3 x float> @v_repeat_divisor_f32_x3_arcp(float %x, float %y, float %z, f
 ; GFX11-NEXT:    v_div_fmas_f32 v4, v4, v5, v6
 ; GFX11-NEXT:    v_div_fixup_f32 v3, v4, v3, 1.0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_mul_f32_e32 v1, v1, v3
 ; GFX11-NEXT:    v_mul_f32_e32 v0, v0, v3
+; GFX11-NEXT:    v_mul_f32_e32 v1, v1, v3
 ; GFX11-NEXT:    v_mul_f32_e32 v2, v2, v3
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %div0 = fdiv arcp float %x, %D
diff --git a/llvm/test/CodeGen/AMDGPU/sdiv.ll b/llvm/test/CodeGen/AMDGPU/sdiv.ll
index d06d9f97db71c..676359fcec462 100644
--- a/llvm/test/CodeGen/AMDGPU/sdiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/sdiv.ll
@@ -804,82 +804,83 @@ define amdgpu_kernel void @sdiv_v4i32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; GCN-NEXT:    s_mov_b32 s4, s0
 ; GCN-NEXT:    s_mov_b32 s5, s1
 ; GCN-NEXT:    s_waitcnt vmcnt(1)
-; GCN-NEXT:    v_sub_i32_e32 v13, vcc, 0, v1
+; GCN-NEXT:    v_sub_i32_e32 v9, vcc, 0, v0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
-; GCN-NEXT:    v_sub_i32_e32 v12, vcc, 0, v5
-; GCN-NEXT:    v_xor_b32_e32 v11, v1, v5
-; GCN-NEXT:    v_max_i32_e32 v5, v5, v12
-; GCN-NEXT:    v_cvt_f32_u32_e32 v12, v5
 ; GCN-NEXT:    v_sub_i32_e32 v10, vcc, 0, v4
 ; GCN-NEXT:    v_xor_b32_e32 v8, v0, v4
-; GCN-NEXT:    v_rcp_iflag_f32_e32 v12, v12
 ; GCN-NEXT:    v_max_i32_e32 v4, v4, v10
-; GCN-NEXT:    v_sub_i32_e32 v16, vcc, 0, v5
-; GCN-NEXT:    v_mul_f32_e32 v10, 0x4f7ffffe, v12
+; GCN-NEXT:    v_cvt_f32_u32_e32 v10, v4
+; GCN-NEXT:    v_sub_i32_e32 v13, vcc, 0, v5
+; GCN-NEXT:    v_xor_b32_e32 v11, v1, v5
+; GCN-NEXT:    v_rcp_iflag_f32_e32 v10, v10
+; GCN-NEXT:    v_max_i32_e32 v5, v5, v13
+; GCN-NEXT:    v_cvt_f32_u32_e32 v13, v5
+; GCN-NEXT:    v_sub_i32_e32 v16, vcc, 0, v4
+; GCN-NEXT:    v_mul_f32_e32 v10, 0x4f7ffffe, v10
 ; GCN-NEXT:    v_cvt_u32_f32_e32 v10, v10
-; GCN-NEXT:    v_cvt_f32_u32_e32 v12, v4
-; GCN-NEXT:    v_max_i32_e32 v1, v1, v13
-; GCN-NEXT:    v_sub_i32_e32 v15, vcc, 0, v6
+; GCN-NEXT:    v_rcp_iflag_f32_e32 v13, v13
+; GCN-NEXT:    v_sub_i32_e32 v12, vcc, 0, v1
 ; GCN-NEXT:    v_mul_lo_u32 v16, v16, v10
-; GCN-NEXT:    v_rcp_iflag_f32_e32 v12, v12
+; GCN-NEXT:    v_mul_f32_e32 v13, 0x4f7ffffe, v13
+; GCN-NEXT:    v_cvt_u32_f32_e32 v13, v13
+; GCN-NEXT:    v_max_i32_e32 v0, v0, v9
+; GCN-NEXT:    v_mul_hi_u32 v16, v10, v16
+; GCN-NEXT:    v_max_i32_e32 v1, v1, v12
+; GCN-NEXT:    v_sub_i32_e32 v15, vcc, 0, v6
+; GCN-NEXT:    v_add_i32_e32 v10, vcc, v10, v16
+; GCN-NEXT:    v_sub_i32_e32 v16, vcc, 0, v5
+; GCN-NEXT:    v_mul_lo_u32 v16, v16, v13
+; GCN-NEXT:    v_mul_hi_u32 v10, v0, v10
 ; GCN-NEXT:    v_xor_b32_e32 v14, v2, v6
 ; GCN-NEXT:    v_max_i32_e32 v6, v6, v15
-; GCN-NEXT:    v_mul_hi_u32 v16, v10, v16
-; GCN-NEXT:    v_mul_f32_e32 v12, 0x4f7ffffe, v12
-; GCN-NEXT:    v_cvt_u32_f32_e32 v12, v12
+; GCN-NEXT:    v_mul_hi_u32 v12, v13, v16
 ; GCN-NEXT:    v_cvt_f32_u32_e32 v15, v6
-; GCN-NEXT:    v_add_i32_e32 v10, vcc, v10, v16
-; GCN-NEXT:    v_sub_i32_e32 v16, vcc, 0, v4
-; GCN-NEXT:    v_mul_lo_u32 v16, v16, v12
-; GCN-NEXT:    v_mul_hi_u32 v10, v1, v10
-; GCN-NEXT:    v_sub_i32_e32 v9, vcc, 0, v0
-; GCN-NEXT:    v_mul_hi_u32 v13, v12, v16
-; GCN-NEXT:    v_max_i32_e32 v0, v0, v9
-; GCN-NEXT:    v_rcp_iflag_f32_e32 v9, v15
 ; GCN-NEXT:    v_ashrrev_i32_e32 v8, 31, v8
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, v12, v13
-; GCN-NEXT:    v_mul_lo_u32 v13, v10, v5
-; GCN-NEXT:    v_mul_hi_u32 v12, v0, v12
-; GCN-NEXT:    v_mul_f32_e32 v9, 0x4f7ffffe, v9
-; GCN-NEXT:    v_cvt_u32_f32_e32 v9, v9
-; GCN-NEXT:    v_sub_i32_e32 v1, vcc, v1, v13
+; GCN-NEXT:    v_ashrrev_i32_e32 v11, 31, v11
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, v13, v12
+; GCN-NEXT:    v_mul_lo_u32 v13, v10, v4
+; GCN-NEXT:    v_mul_hi_u32 v12, v1, v12
+; GCN-NEXT:    v_rcp_iflag_f32_e32 v9, v15
+; GCN-NEXT:    v_ashrrev_i32_e32 v14, 31, v14
+; GCN-NEXT:    v_sub_i32_e32 v0, vcc, v0, v13
 ; GCN-NEXT:    v_add_i32_e32 v13, vcc, 1, v10
-; GCN-NEXT:    v_cmp_ge_u32_e64 s[0:1], v1, v5
+; GCN-NEXT:    v_cmp_ge_u32_e64 s[0:1], v0, v4
 ; GCN-NEXT:    v_cndmask_b32_e64 v10, v10, v13, s[0:1]
-; GCN-NEXT:    v_sub_i32_e32 v13, vcc, v1, v5
-; GCN-NEXT:    v_cndmask_b32_e64 v1, v1, v13, s[0:1]
-; GCN-NEXT:    v_cmp_ge_u32_e64 s[0:1], v1, v5
-; GCN-NEXT:    v_mul_lo_u32 v1, v12, v4
-; GCN-NEXT:    v_sub_i32_e32 v5, vcc, 0, v6
-; GCN-NEXT:    v_mul_lo_u32 v5, v5, v9
-; GCN-NEXT:    v_sub_i32_e32 v0, vcc, v0, v1
+; GCN-NEXT:    v_sub_i32_e32 v13, vcc, v0, v4
+; GCN-NEXT:    v_cndmask_b32_e64 v0, v0, v13, s[0:1]
+; GCN-NEXT:    v_cmp_ge_u32_e64 s[0:1], v0, v4
+; GCN-NEXT:    v_mul_lo_u32 v0, v12, v5
+; GCN-NEXT:    v_mul_f32_e32 v9, 0x4f7ffffe, v9
+; GCN-NEXT:    v_cvt_u32_f32_e32 v9, v9
+; GCN-NEXT:    v_sub_i32_e32 v4, vcc, 0, v6
+; GCN-NEXT:    v_sub_i32_e32 v0, vcc, v1, v0
 ; GCN-NEXT:    v_add_i32_e32 v1, vcc, 1, v12
-; GCN-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v4
+; GCN-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v5
 ; GCN-NEXT:    v_cndmask_b32_e64 v1, v12, v1, s[2:3]
-; GCN-NEXT:    v_sub_i32_e32 v12, vcc, v0, v4
+; GCN-NEXT:    v_sub_i32_e32 v12, vcc, v0, v5
+; GCN-NEXT:    v_mul_lo_u32 v4, v4, v9
 ; GCN-NEXT:    v_cndmask_b32_e64 v0, v0, v12, s[2:3]
-; GCN-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v4
+; GCN-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v5
 ; GCN-NEXT:    v_sub_i32_e32 v0, vcc, 0, v7
-; GCN-NEXT:    v_mul_hi_u32 v4, v9, v5
 ; GCN-NEXT:    v_max_i32_e32 v5, v7, v0
 ; GCN-NEXT:    v_cvt_f32_u32_e32 v0, v5
-; GCN-NEXT:    v_add_i32_e32 v12, vcc, 1, v1
-; GCN-NEXT:    v_add_i32_e32 v4, vcc, v9, v4
+; GCN-NEXT:    v_mul_hi_u32 v4, v9, v4
+; GCN-NEXT:    v_add_i32_e32 v13, vcc, 1, v10
 ; GCN-NEXT:    v_rcp_iflag_f32_e32 v0, v0
+; GCN-NEXT:    v_add_i32_e32 v4, vcc, v9, v4
 ; GCN-NEXT:    v_sub_i32_e32 v9, vcc, 0, v2
 ; GCN-NEXT:    v_max_i32_e32 v2, v2, v9
 ; GCN-NEXT:    v_mul_hi_u32 v4, v2, v4
 ; GCN-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
 ; GCN-NEXT:    v_cvt_u32_f32_e32 v9, v0
-; GCN-NEXT:    v_cndmask_b32_e64 v0, v1, v12, s[2:3]
+; GCN-NEXT:    v_cndmask_b32_e64 v0, v10, v13, s[0:1]
 ; GCN-NEXT:    v_xor_b32_e32 v0, v0, v8
 ; GCN-NEXT:    v_sub_i32_e32 v0, vcc, v0, v8
 ; GCN-NEXT:    v_mul_lo_u32 v8, v4, v6
-; GCN-NEXT:    v_add_i32_e32 v13, vcc, 1, v10
-; GCN-NEXT:    v_cndmask_b32_e64 v1, v10, v13, s[0:1]
+; GCN-NEXT:    v_add_i32_e32 v12, vcc, 1, v1
 ; GCN-NEXT:    v_sub_i32_e32 v10, vcc, 0, v5
 ; GCN-NEXT:    v_sub_i32_e32 v2, vcc, v2, v8
-; GCN-NEXT:    v_ashrrev_i32_e32 v11, 31, v11
+; GCN-NEXT:    v_cndmask_b32_e64 v1, v1, v12, s[2:3]
 ; GCN-NEXT:    v_mul_lo_u32 v10, v10, v9
 ; GCN-NEXT:    v_add_i32_e32 v8, vcc, 1, v4
 ; GCN-NEXT:    v_cmp_ge_u32_e64 s[0:1], v2, v6
@@ -896,7 +897,6 @@ define amdgpu_kernel void @sdiv_v4i32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; GCN-NEXT:    v_max_i32_e32 v6, v3, v6
 ; GCN-NEXT:    v_add_i32_e32 v4, vcc, v9, v4
 ; GCN-NEXT:    v_mul_hi_u32 v4, v6, v4
-; GCN-NEXT:    v_ashrrev_i32_e32 v14, 31, v14
 ; GCN-NEXT:    v_xor_b32_e32 v2, v2, v14
 ; GCN-NEXT:    v_sub_i32_e32 v2, vcc, v2, v14
 ; GCN-NEXT:    v_mul_lo_u32 v8, v4, v5
@@ -931,82 +931,83 @@ define amdgpu_kernel void @sdiv_v4i32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; TONGA-NEXT:    s_mov_b32 s4, s0
 ; TONGA-NEXT:    s_mov_b32 s5, s1
 ; TONGA-NEXT:    s_waitcnt vmcnt(1)
-; TONGA-NEXT:    v_sub_u32_e32 v13, vcc, 0, v1
+; TONGA-NEXT:    v_sub_u32_e32 v9, vcc, 0, v0
 ; TONGA-NEXT:    s_waitcnt vmcnt(0)
-; TONGA-NEXT:    v_sub_u32_e32 v12, vcc, 0, v5
-; TONGA-NEXT:    v_xor_b32_e32 v11, v1, v5
-; TONGA-NEXT:    v_max_i32_e32 v5, v5, v12
-; TONGA-NEXT:    v_cvt_f32_u32_e32 v12, v5
 ; TONGA-NEXT:    v_sub_u32_e32 v10, vcc, 0, v4
 ; TONGA-NEXT:    v_xor_b32_e32 v8, v0, v4
-; TONGA-NEXT:    v_rcp_iflag_f32_e32 v12, v12
 ; TONGA-NEXT:    v_max_i32_e32 v4, v4, v10
-; TONGA-NEXT:    v_sub_u32_e32 v16, vcc, 0, v5
-; TONGA-NEXT:    v_mul_f32_e32 v10, 0x4f7ffffe, v12
+; TONGA-NEXT:    v_cvt_f32_u32_e32 v10, v4
+; TONGA-NEXT:    v_sub_u32_e32 v13, vcc, 0, v5
+; TONGA-NEXT:    v_xor_b32_e32 v11, v1, v5
+; TONGA-NEXT:    v_rcp_iflag_f32_e32 v10, v10
+; TONGA-NEXT:    v_max_i32_e32 v5, v5, v13
+; TONGA-NEXT:    v_cvt_f32_u32_e32 v13, v5
+; TONGA-NEXT:    v_sub_u32_e32 v16, vcc, 0, v4
+; TONGA-NEXT:    v_mul_f32_e32 v10, 0x4f7ffffe, v10
 ; TONGA-NEXT:    v_cvt_u32_f32_e32 v10, v10
-; TONGA-NEXT:    v_cvt_f32_u32_e32 v12, v4
-; TONGA-NEXT:    v_max_i32_e32 v1, v1, v13
-; TONGA-NEXT:    v_sub_u32_e32 v15, vcc, 0, v6
+; TONGA-NEXT:    v_rcp_iflag_f32_e32 v13, v13
+; TONGA-NEXT:    v_sub_u32_e32 v12, vcc, 0, v1
 ; TONGA-NEXT:    v_mul_lo_u32 v16, v16, v10
-; TONGA-NEXT:    v_rcp_iflag_f32_e32 v12, v12
+; TONGA-NEXT:    v_mul_f32_e32 v13, 0x4f7ffffe, v13
+; TONGA-NEXT:    v_cvt_u32_f32_e32 v13, v13
+; TONGA-NEXT:    v_max_i32_e32 v0, v0, v9
+; TONGA-NEXT:    v_mul_hi_u32 v16, v10, v16
+; TONGA-NEXT:    v_max_i32_e32 v1, v1, v12
+; TONGA-NEXT:    v_sub_u32_e32 v15, vcc, 0, v6
+; TONGA-NEXT:    v_add_u32_e32 v10, vcc, v10, v16
+; TONGA-NEXT:    v_sub_u32_e32 v16, vcc, 0, v5
+; TONGA-NEXT:    v_mul_lo_u32 v16, v16, v13
+; TONGA-NEXT:    v_mul_hi_u32 v10, v0, v10
 ; TONGA-NEXT:    v_xor_b32_e32 v14, v2, v6
 ; TONGA-NEXT:    v_max_i32_e32 v6, v6, v15
-; TONGA-NEXT:    v_mul_hi_u32 v16, v10, v16
-; TONGA-NEXT:    v_mul_f32_e32 v12, 0x4f7ffffe, v12
-; TONGA-NEXT:    v_cvt_u32_f32_e32 v12, v12
+; TONGA-NEXT:    v_mul_hi_u32 v12, v13, v16
 ; TONGA-NEXT:    v_cvt_f32_u32_e32 v15, v6
-; TONGA-NEXT:    v_add_u32_e32 v10, vcc, v10, v16
-; TONGA-NEXT:    v_sub_u32_e32 v16, vcc, 0, v4
-; TONGA-NEXT:    v_mul_lo_u32 v16, v16, v12
-; TONGA-NEXT:    v_mul_hi_u32 v10, v1, v10
-; TONGA-NEXT:    v_sub_u32_e32 v9, vcc, 0, v0
-; TONGA-NEXT:    v_mul_hi_u32 v13, v12, v16
-; TONGA-NEXT:    v_max_i32_e32 v0, v0, v9
-; TONGA-NEXT:    v_rcp_iflag_f32_e32 v9, v15
 ; TONGA-NEXT:    v_ashrrev_i32_e32 v8, 31, v8
-; TONGA-NEXT:    v_add_u32_e32 v12, vcc, v12, v13
-; TONGA-NEXT:    v_mul_lo_u32 v13, v10, v5
-; TONGA-NEXT:    v_mul_hi_u32 v12, v0, v12
-; TONGA-NEXT:    v_mul_f32_e32 v9, 0x4f7ffffe, v9
-; TONGA-NEXT:    v_cvt_u32_f32_e32 v9, v9
-; TONGA-NEXT:    v_sub_u32_e32 v1, vcc, v1, v13
+; TONGA-NEXT:    v_ashrrev_i32_e32 v11, 31, v11
+; TONGA-NEXT:    v_add_u32_e32 v12, vcc, v13, v12
+; TONGA-NEXT:    v_mul_lo_u32 v13, v10, v4
+; TONGA-NEXT:    v_mul_hi_u32 v12, v1, v12
+; TONGA-NEXT:    v_rcp_iflag_f32_e32 v9, v15
+; TONGA-NEXT:    v_ashrrev_i32_e32 v14, 31, v14
+; TONGA-NEXT:    v_sub_u32_e32 v0, vcc, v0, v13
 ; TONGA-NEXT:    v_add_u32_e32 v13, vcc, 1, v10
-; TONGA-NEXT:    v_cmp_ge_u32_e64 s[0:1], v1, v5
+; TONGA-NEXT:    v_cmp_ge_u32_e64 s[0:1], v0, v4
 ; TONGA-NEXT:    v_cndmask_b32_e64 v10, v10, v13, s[0:1]
-; TONGA-NEXT:    v_sub_u32_e32 v13, vcc, v1, v5
-; TONGA-NEXT:    v_cndmask_b32_e64 v1, v1, v13, s[0:1]
-; TONGA-NEXT:    v_cmp_ge_u32_e64 s[0:1], v1, v5
-; TONGA-NEXT:    v_mul_lo_u32 v1, v12, v4
-; TONGA-NEXT:    v_sub_u32_e32 v5, vcc, 0, v6
-; TONGA-NEXT:    v_mul_lo_u32 v5, v5, v9
-; TONGA-NEXT:    v_sub_u32_e32 v0, vcc, v0, v1
+; TONGA-NEXT:    v_sub_u32_e32 v13, vcc, v0, v4
+; TONGA-NEXT:    v_cndmask_b32_e64 v0, v0, v13, s[0:1]
+; TONGA-NEXT:    v_cmp_ge_u32_e64 s[0:1], v0, v4
+; TONGA-NEXT:    v_mul_lo_u32 v0, v12, v5
+; TONGA-NEXT:    v_mul_f32_e32 v9, 0x4f7ffffe, v9
+; TONGA-NEXT:    v_cvt_u32_f32_e32 v9, v9
+; TONGA-NEXT:    v_sub_u32_e32 v4, vcc, 0, v6
+; TONGA-NEXT:    v_sub_u32_e32 v0, vcc, v1, v0
 ; TONGA-NEXT:    v_add_u32_e32 v1, vcc, 1, v12
-; TONGA-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v4
+; TONGA-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v5
 ; TONGA-NEXT:    v_cndmask_b32_e64 v1, v12, v1, s[2:3]
-; TONGA-NEXT:    v_sub_u32_e32 v12, vcc, v0, v4
+; TONGA-NEXT:    v_sub_u32_e32 v12, vcc, v0, v5
+; TONGA-NEXT:    v_mul_lo_u32 v4, v4, v9
 ; TONGA-NEXT:    v_cndmask_b32_e64 v0, v0, v12, s[2:3]
-; TONGA-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v4
+; TONGA-NEXT:    v_cmp_ge_u32_e64 s[2:3], v0, v5
 ; TONGA-NEXT:    v_sub_u32_e32 v0, vcc, 0, v7
-; TONGA-NEXT:    v_mul_hi_u32 v4, v9, v5
 ; TONGA-NEXT:    v_max_i32_e32 v5, v7, v0
 ; TONGA-NEXT:    v_cvt_f32_u32_e32 v0, v5
-; TONGA-NEXT:    v_add_u32_e32 v12, vcc, 1, v1
-; TONGA-NEXT:    v_add_u32_e32 v4, vcc, v9, v4
+; TONGA-NEXT:    v_mul_hi_u32 v4, v9, v4
+; TONGA-NEXT:    v_add_u32_e32 v13, vcc, 1, v10
 ; TONGA-NEXT:    v_rcp_iflag_f32_e32 v0, v0
+; TONGA-NEXT:    v_add_u32_e32 v4, vcc, v9, v4
 ; TONGA-NEXT:    v_sub_u32_e32 v9, vcc, 0, v2
 ; TONGA-NEXT:    v_max_i32_e32 v2, v2, v9
 ; TONGA-NEXT:    v_mul_hi_u32 v4, v2, v4
 ; TONGA-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
 ; TONGA-NEXT:    v_cvt_u32_f32_e32 v9, v0
-; TONGA-NEXT:    v_cndmask_b32_e64 v0, v1, v12, s[2:3]
+; TONGA-NEXT:    v_cndmask_b32_e64 v0, v10, v13, s[0:1]
 ; TONGA-NEXT:    v_xor_b32_e32 v0, v0, v8
 ; TONGA-NEXT:    v_sub_u32_e32 v0, vcc, v0, v8
 ; TONGA-NEXT:    v_mul_lo_u32 v8, v4, v6
-; TONGA-NEXT:    v_add_u32_e32 v13, vcc, 1, v10
-; TONGA-NEXT:    v_cndmask_b32_e64 v1, v10, v13, s[0:1]
+; TONGA-NEXT:    v_add_u32_e32 v12, vcc, 1, v1
 ; TONGA-NEXT:    v_sub_u32_e32 v10, vcc, 0, v5
 ; TONGA-NEXT:    v_sub_u32_e32 v2, vcc, v2, v8
-; TONGA-NEXT:    v_ashrrev_i32_e32 v11, 31, v11
+; TONGA-NEXT:    v_cndmask_b32_e64 v1, v1, v12, s[2:3]
 ; TONGA-NEXT:    v_mul_lo_u32 v10, v10, v9
 ; TONGA-NEXT:    v_add_u32_e32 v8, vcc, 1, v4
 ; TONGA-NEXT:    v_cmp_ge_u32_e64 s[0:1], v2, v6
@@ -1023,7 +1024,6 @@ define amdgpu_kernel void @sdiv_v4i32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; TONGA-NEXT:    v_max_i32_e32 v6, v3, v6
 ; TONGA-NEXT:    v_add_u32_e32 v4, vcc, v9, v4
 ; TONGA-NEXT:    v_mul_hi_u32 v4, v6, v4
-; TONGA-NEXT:    v_ashrrev_i32_e32 v14, 31, v14
 ; TONGA-NEXT:    v_xor_b32_e32 v2, v2, v14
 ; TONGA-NEXT:    v_sub_u32_e32 v2, vcc, v2, v14
 ; TONGA-NEXT:    v_mul_lo_u32 v8, v4, v5
diff --git a/llvm/test/CodeGen/AMDGPU/select.f16.ll b/llvm/test/CodeGen/AMDGPU/select.f16.ll
index 7339b545686f5..ca450e1882454 100644
--- a/llvm/test/CodeGen/AMDGPU/select.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/select.f16.ll
@@ -2183,37 +2183,41 @@ define <16 x half> @v_vselect_v16f16(<16 x half> %a, <16 x half> %b, <16 x i32>
 ; SI-LABEL: v_vselect_v16f16:
 ; SI:       ; %bb.0:
 ; SI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; SI-NEXT:    buffer_load_dword v36, off, s[0:3], s32 offset:4
-; SI-NEXT:    v_cvt_f16_f32_e32 v16, v16
-; SI-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; SI-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; SI-NEXT:    v_cvt_f16_f32_e32 v17, v17
-; SI-NEXT:    v_cvt_f32_f16_e32 v37, v16
-; SI-NEXT:    buffer_load_dword v38, off, s[0:3], s32 offset:8
-; SI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:12
+; SI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:12
 ; SI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:16
 ; SI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:20
 ; SI-NEXT:    buffer_load_dword v34, off, s[0:3], s32 offset:24
-; SI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:28
-; SI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:32
+; SI-NEXT:    buffer_load_dword v35, off, s[0:3], s32 offset:28
+; SI-NEXT:    v_cvt_f16_f32_e32 v4, v4
+; SI-NEXT:    v_cvt_f16_f32_e32 v20, v20
+; SI-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; SI-NEXT:    v_cvt_f16_f32_e32 v16, v16
+; SI-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; SI-NEXT:    v_cvt_f32_f16_e32 v20, v20
 ; SI-NEXT:    v_cvt_f32_f16_e32 v0, v0
-; SI-NEXT:    v_cvt_f32_f16_e32 v1, v1
-; SI-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; SI-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; SI-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; SI-NEXT:    v_cvt_f16_f32_e32 v2, v2
+; SI-NEXT:    v_cvt_f16_f32_e32 v18, v18
 ; SI-NEXT:    v_cvt_f16_f32_e32 v3, v3
-; SI-NEXT:    v_cvt_f16_f32_e32 v4, v4
-; SI-NEXT:    v_cvt_f16_f32_e32 v5, v5
+; SI-NEXT:    v_cvt_f32_f16_e32 v1, v1
+; SI-NEXT:    v_cvt_f16_f32_e32 v19, v19
 ; SI-NEXT:    v_cvt_f32_f16_e32 v2, v2
+; SI-NEXT:    v_cvt_f32_f16_e32 v18, v18
 ; SI-NEXT:    v_cvt_f32_f16_e32 v3, v3
-; SI-NEXT:    v_cvt_f32_f16_e32 v4, v4
+; SI-NEXT:    v_cvt_f32_f16_e32 v19, v19
+; SI-NEXT:    v_cvt_f16_f32_e32 v5, v5
 ; SI-NEXT:    v_cvt_f16_f32_e32 v6, v6
-; SI-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; SI-NEXT:    v_cvt_f16_f32_e32 v7, v7
 ; SI-NEXT:    v_cvt_f16_f32_e32 v8, v8
+; SI-NEXT:    v_cvt_f32_f16_e32 v5, v5
 ; SI-NEXT:    v_cvt_f32_f16_e32 v6, v6
-; SI-NEXT:    v_cvt_f16_f32_e32 v9, v9
 ; SI-NEXT:    v_cvt_f32_f16_e32 v7, v7
+; SI-NEXT:    v_cvt_f16_f32_e32 v24, v24
 ; SI-NEXT:    v_cvt_f32_f16_e32 v8, v8
+; SI-NEXT:    v_cvt_f16_f32_e32 v9, v9
+; SI-NEXT:    v_cvt_f16_f32_e32 v25, v25
+; SI-NEXT:    v_cvt_f32_f16_e32 v24, v24
 ; SI-NEXT:    v_cvt_f16_f32_e32 v10, v10
 ; SI-NEXT:    v_cvt_f32_f16_e32 v9, v9
 ; SI-NEXT:    v_cvt_f16_f32_e32 v11, v11
@@ -2223,162 +2227,154 @@ define <16 x half> @v_vselect_v16f16(<16 x half> %a, <16 x half> %b, <16 x i32>
 ; SI-NEXT:    v_cvt_f32_f16_e32 v11, v11
 ; SI-NEXT:    v_cvt_f32_f16_e32 v12, v12
 ; SI-NEXT:    v_cvt_f16_f32_e32 v14, v14
-; SI-NEXT:    v_cvt_f16_f32_e32 v15, v15
 ; SI-NEXT:    v_cvt_f32_f16_e32 v13, v13
+; SI-NEXT:    v_cvt_f16_f32_e32 v15, v15
 ; SI-NEXT:    v_cvt_f32_f16_e32 v14, v14
 ; SI-NEXT:    v_cvt_f32_f16_e32 v15, v15
-; SI-NEXT:    s_waitcnt vmcnt(7)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v36
-; SI-NEXT:    v_cndmask_b32_e32 v0, v37, v0, vcc
-; SI-NEXT:    s_waitcnt vmcnt(6)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v38
-; SI-NEXT:    v_cndmask_b32_e32 v1, v17, v1, vcc
-; SI-NEXT:    v_cvt_f16_f32_e32 v17, v18
-; SI-NEXT:    s_waitcnt vmcnt(5)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v35
-; SI-NEXT:    v_cvt_f16_f32_e32 v18, v20
-; SI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:48
-; SI-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; SI-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; SI-NEXT:    v_cndmask_b32_e32 v2, v17, v2, vcc
-; SI-NEXT:    v_cvt_f16_f32_e32 v17, v19
-; SI-NEXT:    s_waitcnt vmcnt(5)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v32
-; SI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:36
-; SI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:44
-; SI-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; SI-NEXT:    v_cndmask_b32_e32 v3, v17, v3, vcc
-; SI-NEXT:    s_waitcnt vmcnt(6)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v33
-; SI-NEXT:    buffer_load_dword v33, off, s[0:3], s32 offset:40
-; SI-NEXT:    v_cvt_f16_f32_e32 v17, v21
-; SI-NEXT:    v_cndmask_b32_e32 v4, v18, v4, vcc
-; SI-NEXT:    v_cvt_f16_f32_e32 v18, v22
-; SI-NEXT:    s_waitcnt vmcnt(6)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v34
-; SI-NEXT:    v_cvt_f32_f16_e32 v17, v17
-; SI-NEXT:    v_cvt_f16_f32_e32 v22, v23
-; SI-NEXT:    v_cvt_f32_f16_e32 v21, v18
-; SI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:56
-; SI-NEXT:    v_cndmask_b32_e32 v5, v17, v5, vcc
-; SI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:52
-; SI-NEXT:    s_waitcnt vmcnt(7)
+; SI-NEXT:    s_waitcnt vmcnt(3)
+; SI-NEXT:    v_cmp_eq_u32_e64 s[4:5], 0, v32
+; SI-NEXT:    s_waitcnt vmcnt(2)
+; SI-NEXT:    v_cmp_eq_u32_e64 s[6:7], 0, v33
+; SI-NEXT:    v_cndmask_b32_e64 v4, v20, v4, s[6:7]
+; SI-NEXT:    buffer_load_dword v20, off, s[0:3], s32 offset:44
 ; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v31
-; SI-NEXT:    v_cndmask_b32_e32 v6, v21, v6, vcc
-; SI-NEXT:    s_waitcnt vmcnt(6)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v16
-; SI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:60
-; SI-NEXT:    buffer_load_dword v21, off, s[0:3], s32
-; SI-NEXT:    v_cvt_f32_f16_e32 v22, v22
-; SI-NEXT:    v_cvt_f16_f32_e32 v23, v24
-; SI-NEXT:    v_cvt_f16_f32_e32 v24, v25
-; SI-NEXT:    v_cndmask_b32_e32 v7, v22, v7, vcc
-; SI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:64
-; SI-NEXT:    v_cvt_f32_f16_e32 v23, v23
-; SI-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; SI-NEXT:    s_waitcnt vmcnt(7)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v32
-; SI-NEXT:    v_cndmask_b32_e32 v8, v23, v8, vcc
-; SI-NEXT:    v_cvt_f16_f32_e32 v23, v26
-; SI-NEXT:    v_cvt_f32_f16_e32 v23, v23
-; SI-NEXT:    s_waitcnt vmcnt(5)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v33
-; SI-NEXT:    v_cndmask_b32_e32 v9, v24, v9, vcc
-; SI-NEXT:    v_cvt_f16_f32_e32 v24, v27
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v19
-; SI-NEXT:    v_cvt_f16_f32_e32 v19, v28
-; SI-NEXT:    v_cndmask_b32_e32 v10, v23, v10, vcc
-; SI-NEXT:    v_cvt_f32_f16_e32 v24, v24
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v20
+; SI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:4
+; SI-NEXT:    buffer_load_dword v32, off, s[0:3], s32 offset:8
+; SI-NEXT:    v_cndmask_b32_e32 v2, v18, v2, vcc
+; SI-NEXT:    v_cndmask_b32_e64 v3, v19, v3, s[4:5]
+; SI-NEXT:    v_cvt_f16_f32_e32 v19, v21
+; SI-NEXT:    v_cvt_f16_f32_e32 v21, v22
+; SI-NEXT:    s_waitcnt vmcnt(4)
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v34
+; SI-NEXT:    buffer_load_dword v18, off, s[0:3], s32 offset:40
 ; SI-NEXT:    v_cvt_f32_f16_e32 v19, v19
-; SI-NEXT:    v_cvt_f16_f32_e32 v20, v29
-; SI-NEXT:    v_cndmask_b32_e32 v11, v24, v11, vcc
-; SI-NEXT:    s_waitcnt vmcnt(3)
+; SI-NEXT:    v_cndmask_b32_e32 v5, v19, v5, vcc
+; SI-NEXT:    v_cvt_f32_f16_e32 v19, v21
+; SI-NEXT:    v_cvt_f16_f32_e32 v21, v23
+; SI-NEXT:    s_waitcnt vmcnt(4)
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v35
+; SI-NEXT:    v_cndmask_b32_e32 v6, v19, v6, vcc
+; SI-NEXT:    v_cvt_f32_f16_e32 v19, v21
+; SI-NEXT:    s_waitcnt vmcnt(2)
+; SI-NEXT:    v_cmp_eq_u32_e64 s[8:9], 0, v31
+; SI-NEXT:    v_cndmask_b32_e64 v0, v16, v0, s[8:9]
+; SI-NEXT:    v_cvt_f16_f32_e32 v16, v17
+; SI-NEXT:    s_waitcnt vmcnt(1)
+; SI-NEXT:    v_cmp_eq_u32_e64 s[8:9], 0, v32
+; SI-NEXT:    buffer_load_dword v17, off, s[0:3], s32 offset:36
+; SI-NEXT:    buffer_load_dword v21, off, s[0:3], s32 offset:48
+; SI-NEXT:    buffer_load_dword v22, off, s[0:3], s32 offset:52
+; SI-NEXT:    buffer_load_dword v23, off, s[0:3], s32
+; SI-NEXT:    buffer_load_dword v31, off, s[0:3], s32 offset:60
+; SI-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; SI-NEXT:    v_cndmask_b32_e64 v1, v16, v1, s[8:9]
+; SI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:32
+; SI-NEXT:    s_waitcnt vmcnt(0)
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v16
+; SI-NEXT:    buffer_load_dword v16, off, s[0:3], s32 offset:56
+; SI-NEXT:    v_cndmask_b32_e32 v7, v19, v7, vcc
+; SI-NEXT:    buffer_load_dword v19, off, s[0:3], s32 offset:64
 ; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v17
-; SI-NEXT:    v_cvt_f16_f32_e32 v17, v30
-; SI-NEXT:    v_cndmask_b32_e32 v12, v19, v12, vcc
+; SI-NEXT:    v_cndmask_b32_e32 v8, v24, v8, vcc
+; SI-NEXT:    v_cvt_f32_f16_e32 v17, v25
+; SI-NEXT:    v_cvt_f16_f32_e32 v24, v26
 ; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v18
-; SI-NEXT:    s_waitcnt vmcnt(1)
-; SI-NEXT:    v_cvt_f16_f32_e32 v18, v21
+; SI-NEXT:    v_cvt_f16_f32_e32 v18, v29
+; SI-NEXT:    v_cndmask_b32_e32 v9, v17, v9, vcc
+; SI-NEXT:    v_cvt_f32_f16_e32 v17, v24
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v20
+; SI-NEXT:    v_cvt_f16_f32_e32 v20, v28
+; SI-NEXT:    v_cndmask_b32_e32 v10, v17, v10, vcc
+; SI-NEXT:    v_cvt_f32_f16_e32 v17, v18
+; SI-NEXT:    v_cvt_f16_f32_e32 v18, v27
 ; SI-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; SI-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v21
 ; SI-NEXT:    v_cvt_f32_f16_e32 v18, v18
-; SI-NEXT:    v_cndmask_b32_e32 v13, v20, v13, vcc
+; SI-NEXT:    v_cndmask_b32_e32 v11, v18, v11, vcc
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v22
+; SI-NEXT:    v_cndmask_b32_e32 v12, v20, v12, vcc
+; SI-NEXT:    s_waitcnt vmcnt(1)
 ; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v16
-; SI-NEXT:    v_cndmask_b32_e32 v14, v17, v14, vcc
+; SI-NEXT:    v_cvt_f16_f32_e32 v16, v30
+; SI-NEXT:    v_cndmask_b32_e32 v13, v17, v13, vcc
+; SI-NEXT:    v_cvt_f16_f32_e32 v17, v23
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v31
+; SI-NEXT:    v_cvt_f32_f16_e32 v16, v16
+; SI-NEXT:    v_cvt_f32_f16_e32 v17, v17
+; SI-NEXT:    v_cndmask_b32_e32 v14, v16, v14, vcc
 ; SI-NEXT:    s_waitcnt vmcnt(0)
-; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v22
-; SI-NEXT:    v_cndmask_b32_e32 v15, v18, v15, vcc
+; SI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v19
+; SI-NEXT:    v_cndmask_b32_e32 v15, v17, v15, vcc
 ; SI-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; VI-LABEL: v_vselect_v16f16:
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; VI-NEXT:    v_cmp_eq_u32_e64 s[4:5], 0, v16
-; VI-NEXT:    v_cmp_eq_u32_e64 s[18:19], 0, v17
-; VI-NEXT:    v_cmp_eq_u32_e64 s[40:41], 0, v29
-; VI-NEXT:    v_lshrrev_b32_e32 v16, 16, v6
-; VI-NEXT:    v_lshrrev_b32_e32 v17, 16, v14
-; VI-NEXT:    v_cmp_eq_u32_e64 s[6:7], 0, v18
-; VI-NEXT:    v_cmp_eq_u32_e64 s[28:29], 0, v27
-; VI-NEXT:    v_cndmask_b32_e64 v16, v17, v16, s[40:41]
-; VI-NEXT:    v_lshrrev_b32_e32 v17, 16, v5
-; VI-NEXT:    v_lshrrev_b32_e32 v18, 16, v13
-; VI-NEXT:    v_cmp_eq_u32_e64 s[20:21], 0, v19
-; VI-NEXT:    v_cmp_eq_u32_e64 s[26:27], 0, v25
-; VI-NEXT:    v_cndmask_b32_e64 v17, v18, v17, s[28:29]
-; VI-NEXT:    v_lshrrev_b32_e32 v18, 16, v4
-; VI-NEXT:    v_lshrrev_b32_e32 v19, 16, v12
-; VI-NEXT:    v_cmp_eq_u32_e64 s[8:9], 0, v20
-; VI-NEXT:    v_cmp_eq_u32_e64 s[24:25], 0, v23
-; VI-NEXT:    v_cndmask_b32_e64 v18, v19, v18, s[26:27]
-; VI-NEXT:    v_lshrrev_b32_e32 v19, 16, v3
-; VI-NEXT:    v_lshrrev_b32_e32 v20, 16, v11
-; VI-NEXT:    v_cmp_eq_u32_e64 s[22:23], 0, v21
-; VI-NEXT:    v_cndmask_b32_e64 v19, v20, v19, s[24:25]
-; VI-NEXT:    v_lshrrev_b32_e32 v20, 16, v2
-; VI-NEXT:    v_lshrrev_b32_e32 v21, 16, v10
-; VI-NEXT:    v_cmp_eq_u32_e64 s[10:11], 0, v22
-; VI-NEXT:    v_cndmask_b32_e64 v20, v21, v20, s[22:23]
-; VI-NEXT:    v_lshrrev_b32_e32 v21, 16, v1
-; VI-NEXT:    v_lshrrev_b32_e32 v22, 16, v9
-; VI-NEXT:    v_cndmask_b32_e64 v21, v22, v21, s[20:21]
-; VI-NEXT:    v_lshrrev_b32_e32 v22, 16, v0
-; VI-NEXT:    v_lshrrev_b32_e32 v23, 16, v8
-; VI-NEXT:    v_cndmask_b32_e64 v0, v8, v0, s[4:5]
-; VI-NEXT:    buffer_load_dword v8, off, s[0:3], s32
-; VI-NEXT:    v_cndmask_b32_e64 v22, v23, v22, s[18:19]
-; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v30
-; VI-NEXT:    v_cndmask_b32_e64 v1, v9, v1, s[6:7]
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v22
-; VI-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s[8:9]
-; VI-NEXT:    v_or_b32_sdwa v0, v0, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
-; VI-NEXT:    v_cndmask_b32_e32 v7, v15, v7, vcc
-; VI-NEXT:    v_lshrrev_b32_e32 v10, 16, v15
-; VI-NEXT:    v_cmp_eq_u32_e64 s[12:13], 0, v24
-; VI-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s[10:11]
-; VI-NEXT:    v_cmp_eq_u32_e64 s[14:15], 0, v26
-; VI-NEXT:    v_cndmask_b32_e64 v4, v12, v4, s[12:13]
-; VI-NEXT:    v_cmp_eq_u32_e64 s[16:17], 0, v28
-; VI-NEXT:    v_cndmask_b32_e64 v5, v13, v5, s[14:15]
-; VI-NEXT:    v_cndmask_b32_e64 v6, v14, v6, s[16:17]
-; VI-NEXT:    s_waitcnt vmcnt(0)
-; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v8
-; VI-NEXT:    v_cndmask_b32_e32 v8, v10, v9, vcc
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v21
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v16
+; VI-NEXT:    buffer_load_dword v16, off, s[0:3], s32
+; VI-NEXT:    v_cmp_eq_u32_e64 s[8:9], 0, v22
+; VI-NEXT:    v_cmp_eq_u32_e64 s[10:11], 0, v24
+; VI-NEXT:    v_lshrrev_b32_e32 v22, 16, v6
+; VI-NEXT:    v_lshrrev_b32_e32 v24, 16, v14
+; VI-NEXT:    v_cmp_eq_u32_e64 s[20:21], 0, v29
+; VI-NEXT:    v_cmp_eq_u32_e64 s[12:13], 0, v26
+; VI-NEXT:    v_cmp_eq_u32_e64 s[14:15], 0, v28
+; VI-NEXT:    v_cmp_eq_u32_e64 s[18:19], 0, v27
+; VI-NEXT:    v_lshrrev_b32_e32 v26, 16, v4
+; VI-NEXT:    v_lshrrev_b32_e32 v27, 16, v12
+; VI-NEXT:    v_cndmask_b32_e64 v22, v24, v22, s[20:21]
+; VI-NEXT:    v_lshrrev_b32_e32 v24, 16, v0
+; VI-NEXT:    v_cndmask_b32_e32 v0, v8, v0, vcc
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v25
+; VI-NEXT:    v_cmp_eq_u32_e64 s[4:5], 0, v18
+; VI-NEXT:    v_cmp_eq_u32_e64 s[6:7], 0, v20
+; VI-NEXT:    v_lshrrev_b32_e32 v18, 16, v5
+; VI-NEXT:    v_lshrrev_b32_e32 v20, 16, v13
+; VI-NEXT:    v_cndmask_b32_e64 v6, v14, v6, s[14:15]
+; VI-NEXT:    v_lshrrev_b32_e32 v14, 16, v3
+; VI-NEXT:    v_cndmask_b32_e64 v5, v13, v5, s[12:13]
+; VI-NEXT:    v_lshrrev_b32_e32 v13, 16, v11
+; VI-NEXT:    v_cndmask_b32_e32 v25, v27, v26, vcc
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v23
+; VI-NEXT:    v_cndmask_b32_e64 v4, v12, v4, s[10:11]
+; VI-NEXT:    v_lshrrev_b32_e32 v12, 16, v2
+; VI-NEXT:    v_cndmask_b32_e64 v3, v11, v3, s[8:9]
+; VI-NEXT:    v_lshrrev_b32_e32 v11, 16, v10
+; VI-NEXT:    v_cndmask_b32_e32 v13, v13, v14, vcc
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v21
+; VI-NEXT:    v_cndmask_b32_e64 v2, v10, v2, s[6:7]
+; VI-NEXT:    v_lshrrev_b32_e32 v10, 16, v1
+; VI-NEXT:    v_cndmask_b32_e64 v1, v9, v1, s[4:5]
+; VI-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
+; VI-NEXT:    v_cndmask_b32_e32 v11, v11, v12, vcc
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v19
+; VI-NEXT:    v_cndmask_b32_e32 v9, v9, v10, vcc
+; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v9
+; VI-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v17
 ; VI-NEXT:    v_or_b32_sdwa v1, v1, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v20
+; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v11
+; VI-NEXT:    v_cmp_eq_u32_e64 s[16:17], 0, v30
+; VI-NEXT:    v_cndmask_b32_e32 v8, v8, v24, vcc
 ; VI-NEXT:    v_or_b32_sdwa v2, v2, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v19
+; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v13
+; VI-NEXT:    v_cndmask_b32_e64 v18, v20, v18, s[18:19]
+; VI-NEXT:    v_lshrrev_b32_e32 v20, 16, v7
+; VI-NEXT:    v_cndmask_b32_e64 v7, v15, v7, s[16:17]
+; VI-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
+; VI-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; VI-NEXT:    v_or_b32_sdwa v3, v3, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v18
+; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v25
+; VI-NEXT:    v_or_b32_sdwa v0, v0, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_or_b32_sdwa v4, v4, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v17
+; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v18
 ; VI-NEXT:    v_or_b32_sdwa v5, v5, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v16
-; VI-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
+; VI-NEXT:    v_lshlrev_b32_e32 v9, 16, v22
 ; VI-NEXT:    v_or_b32_sdwa v6, v6, v9 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; VI-NEXT:    s_waitcnt vmcnt(0)
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v16
+; VI-NEXT:    v_cndmask_b32_e32 v8, v15, v20, vcc
+; VI-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
 ; VI-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -3232,11 +3228,11 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; VI-NEXT:    v_lshrrev_b32_e32 v55, 16, v2
 ; VI-NEXT:    v_lshrrev_b32_e32 v43, 16, v18
 ; VI-NEXT:    buffer_load_dword v44, off, s[0:3], s32 offset:108
-; VI-NEXT:    s_waitcnt vmcnt(5)
+; VI-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:92
+; VI-NEXT:    s_waitcnt vmcnt(6)
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v49
 ; VI-NEXT:    v_cndmask_b32_e32 v49, v43, v55, vcc
 ; VI-NEXT:    buffer_load_dword v55, off, s[0:3], s32 offset:100
-; VI-NEXT:    buffer_load_dword v45, off, s[0:3], s32 offset:92
 ; VI-NEXT:    v_lshrrev_b32_e32 v43, 16, v1
 ; VI-NEXT:    v_lshrrev_b32_e32 v46, 16, v17
 ; VI-NEXT:    buffer_load_dword v47, off, s[0:3], s32 offset:84
@@ -3267,10 +3263,9 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; VI-NEXT:    s_waitcnt vmcnt(13)
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v44
 ; VI-NEXT:    v_cndmask_b32_e32 v13, v29, v13, vcc
-; VI-NEXT:    s_waitcnt vmcnt(12)
+; VI-NEXT:    s_waitcnt vmcnt(11)
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v55
 ; VI-NEXT:    v_cndmask_b32_e32 v12, v28, v12, vcc
-; VI-NEXT:    s_waitcnt vmcnt(11)
 ; VI-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v45
 ; VI-NEXT:    v_cndmask_b32_e32 v11, v27, v11, vcc
 ; VI-NEXT:    s_waitcnt vmcnt(10)
@@ -3494,8 +3489,8 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16:       ; %bb.0:
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-FAKE16-NEXT:    s_clause 0x1f
-; GFX11-FAKE16-NEXT:    scratch_load_b32 v31, off, s32 offset:120
-; GFX11-FAKE16-NEXT:    scratch_load_b32 v32, off, s32 offset:112
+; GFX11-FAKE16-NEXT:    scratch_load_b32 v31, off, s32 offset:112
+; GFX11-FAKE16-NEXT:    scratch_load_b32 v32, off, s32 offset:120
 ; GFX11-FAKE16-NEXT:    scratch_load_b32 v33, off, s32
 ; GFX11-FAKE16-NEXT:    scratch_load_b32 v34, off, s32 offset:104
 ; GFX11-FAKE16-NEXT:    scratch_load_b32 v35, off, s32 offset:96
@@ -3527,8 +3522,6 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    scratch_load_b32 v85, off, s32 offset:4
 ; GFX11-FAKE16-NEXT:    scratch_load_b32 v86, off, s32 offset:20
 ; GFX11-FAKE16-NEXT:    scratch_load_b32 v87, off, s32 offset:128
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v97, 16, v14
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v98, 16, v30
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v99, 16, v13
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v100, 16, v29
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v101, 16, v12
@@ -3553,19 +3546,21 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v144, 16, v19
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v145, 16, v2
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v146, 16, v18
+; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v97, 16, v14
+; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v98, 16, v30
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v147, 16, v1
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v96, 16, v15
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(32)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v31
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v31, 16, v17
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v97, v98, v97, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(31)
-; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v32
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v98, 16, v0
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v32, 16, v16
+; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e64 s0, 0, v32
+; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v32, 16, v0
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v99, v100, v99, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(29)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v34
+; GFX11-FAKE16-NEXT:    v_cndmask_b32_e64 v97, v98, v97, s0
+; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v98, 16, v16
 ; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v100, 16, v33
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v34, v102, v101, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(28)
@@ -3603,7 +3598,7 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v31, v31, v147, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(17)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v54
-; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v32, v32, v98, vcc_lo
+; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v32, v98, v32, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(16)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v55
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v15, v33, v15, vcc_lo
@@ -3620,7 +3615,8 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v12, v28, v12, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(12)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v67
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-FAKE16-NEXT:    v_perm_b32 v13, v99, v13, 0x5040100
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v12, v34, v12, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v11, v27, v11, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(11)
@@ -3628,7 +3624,7 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v10, v26, v10, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(10)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v69
-; GFX11-FAKE16-NEXT:    v_perm_b32 v13, v99, v13, 0x5040100
+; GFX11-FAKE16-NEXT:    v_perm_b32 v11, v35, v11, 0x5040100
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v10, v36, v10, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v9, v25, v9, vcc_lo
@@ -3637,7 +3633,7 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v8, v24, v8, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(8)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v71
-; GFX11-FAKE16-NEXT:    v_perm_b32 v11, v35, v11, 0x5040100
+; GFX11-FAKE16-NEXT:    v_perm_b32 v9, v37, v9, 0x5040100
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v8, v38, v8, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v7, v23, v7, vcc_lo
@@ -3646,7 +3642,7 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v6, v22, v6, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(6)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v81
-; GFX11-FAKE16-NEXT:    v_perm_b32 v9, v37, v9, 0x5040100
+; GFX11-FAKE16-NEXT:    v_perm_b32 v7, v39, v7, 0x5040100
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v6, v48, v6, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v5, v21, v5, vcc_lo
@@ -3655,7 +3651,7 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v4, v20, v4, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(4)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v83
-; GFX11-FAKE16-NEXT:    v_perm_b32 v7, v39, v7, 0x5040100
+; GFX11-FAKE16-NEXT:    v_perm_b32 v5, v49, v5, 0x5040100
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v4, v50, v4, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v3, v19, v3, vcc_lo
@@ -3667,17 +3663,16 @@ define <32 x half> @v_vselect_v32f16(<32 x half> %a, <32 x half> %b, <32 x i32>
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v0, v16, v0, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(1)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v86
-; GFX11-FAKE16-NEXT:    v_perm_b32 v5, v49, v5, 0x5040100
+; GFX11-FAKE16-NEXT:    v_perm_b32 v3, v51, v3, 0x5040100
 ; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(VALU_DEP_3)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v0, v32, v0, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v2, v18, v2, vcc_lo
 ; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-FAKE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v87
-; GFX11-FAKE16-NEXT:    v_perm_b32 v3, v51, v3, 0x5040100
+; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v31, v1, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v2, v52, v2, 0x5040100
 ; GFX11-FAKE16-NEXT:    v_cndmask_b32_e32 v16, v100, v96, vcc_lo
-; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v31, v1, 0x5040100
-; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-FAKE16-NEXT:    v_perm_b32 v15, v16, v15, 0x5040100
 ; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
   %cmp = icmp eq <32 x i32> %cond, zeroinitializer
diff --git a/llvm/test/CodeGen/AMDGPU/shl.ll b/llvm/test/CodeGen/AMDGPU/shl.ll
index 593cff712004a..a82a6a8a4c367 100644
--- a/llvm/test/CodeGen/AMDGPU/shl.ll
+++ b/llvm/test/CodeGen/AMDGPU/shl.ll
@@ -878,19 +878,19 @@ define amdgpu_kernel void @shl_v4i64(ptr addrspace(1) %out, ptr addrspace(1) %in
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b32 s8, s2
 ; SI-NEXT:    s_mov_b32 s9, s3
-; SI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:16
-; SI-NEXT:    buffer_load_dwordx4 v[4:7], off, s[8:11], 0 offset:48
+; SI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:48
+; SI-NEXT:    buffer_load_dwordx4 v[3:6], off, s[8:11], 0 offset:16
 ; SI-NEXT:    buffer_load_dwordx4 v[7:10], off, s[8:11], 0
 ; SI-NEXT:    buffer_load_dwordx4 v[11:14], off, s[8:11], 0 offset:32
 ; SI-NEXT:    s_mov_b32 s4, s0
 ; SI-NEXT:    s_mov_b32 s5, s1
 ; SI-NEXT:    s_waitcnt vmcnt(2)
-; SI-NEXT:    v_lshl_b64 v[2:3], v[2:3], v6
-; SI-NEXT:    v_lshl_b64 v[0:1], v[0:1], v4
+; SI-NEXT:    v_lshl_b64 v[5:6], v[5:6], v2
+; SI-NEXT:    v_lshl_b64 v[3:4], v[3:4], v0
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    v_lshl_b64 v[9:10], v[9:10], v13
 ; SI-NEXT:    v_lshl_b64 v[7:8], v[7:8], v11
-; SI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0 offset:16
+; SI-NEXT:    buffer_store_dwordx4 v[3:6], off, s[4:7], 0 offset:16
 ; SI-NEXT:    buffer_store_dwordx4 v[7:10], off, s[4:7], 0
 ; SI-NEXT:    s_endpgm
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll b/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll
index b7e6ebaa655b9..5aafb0f576fb4 100644
--- a/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll
+++ b/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll
@@ -94,8 +94,9 @@ define amdgpu_gs void @_amdgpu_gs_main(i32 inreg %primShaderTableAddrLow, <31 x
   ; CHECK-NEXT:   [[S_BUFFER_LOAD_DWORD_IMM3:%[0-9]+]]:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_IMM undef %368:sgpr_128, 16, 0 :: (dereferenceable invariant load (s32))
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM4:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_3]], 64, 0 :: (invariant load (s128) from %ir.99, addrspace 4)
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM5:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_4]], 64, 0 :: (invariant load (s128) from %ir.107, addrspace 4)
-  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM6:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_6]], 0, 0 :: (invariant load (s128) from %ir.117, addrspace 4)
-  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM7:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_7]], 0, 0 :: (invariant load (s128) from %ir.124, addrspace 4)
+  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM6:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_5]], 0, 0 :: (invariant load (s128) from %ir.112, addrspace 4)
+  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM7:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_6]], 0, 0 :: (invariant load (s128) from %ir.117, addrspace 4)
+  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM8:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_7]], 0, 0 :: (invariant load (s128) from %ir.124, addrspace 4)
   ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN2:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM2]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
   ; CHECK-NEXT:   [[S_BUFFER_LOAD_DWORD_SGPR_IMM4:%[0-9]+]]:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_SGPR_IMM undef %352:sgpr_128, [[S_ADD_I32_]], 0, 0 :: (dereferenceable invariant load (s32))
   ; CHECK-NEXT:   [[S_BUFFER_LOAD_DWORD_SGPR_IMM5:%[0-9]+]]:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_SGPR_IMM undef %363:sgpr_128, [[S_ADD_I32_1]], 0, 0 :: (dereferenceable invariant load (s32))
@@ -104,7 +105,6 @@ define amdgpu_gs void @_amdgpu_gs_main(i32 inreg %primShaderTableAddrLow, <31 x
   ; CHECK-NEXT:   [[S_ADD_I32_3:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM1]], -114, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_4:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM2]], -130, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_5:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_IMM2]], -178, implicit-def dead $scc
-  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM8:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_5]], 0, 0 :: (invariant load (s128) from %ir.112, addrspace 4)
   ; CHECK-NEXT:   undef [[S_ADD_U32_12:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY10]], [[S_LSHL_B32_]], implicit-def $scc
   ; CHECK-NEXT:   [[S_ADD_U32_12:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %42:sreg_32, [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
   ; CHECK-NEXT:   undef [[S_ADD_U32_13:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY11]], [[S_LSHL_B32_]], implicit-def $scc
@@ -121,17 +121,17 @@ define amdgpu_gs void @_amdgpu_gs_main(i32 inreg %primShaderTableAddrLow, <31 x
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM9:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_5]], 224, 0 :: (invariant load (s128) from %ir.129, addrspace 4)
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM10:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY7]], 224, 0 :: (invariant load (s128) from %ir.145, addrspace 4)
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM11:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_5]], 576, 0 :: (invariant load (s128) from %ir.150, addrspace 4)
-  ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN6:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM8]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+  ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN6:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM6]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM12:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_6]], 224, 0 :: (invariant load (s128) from %ir.134, addrspace 4)
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM13:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_7]], 576, 0 :: (invariant load (s128) from %ir.162, addrspace 4)
-  ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN7:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM6]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
-  ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN8:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM7]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM14:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_8]], 224, 0 :: (invariant load (s128) from %ir.140, addrspace 4)
+  ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN7:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM7]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+  ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN8:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM8]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
   ; CHECK-NEXT:   [[S_ADD_I32_7:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM4]], -217, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_8:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM3]], -233, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_9:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM5]], -249, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_10:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_IMM3]], -297, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_11:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM3]], -313, implicit-def dead $scc
-  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM14:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_8]], 224, 0 :: (invariant load (s128) from %ir.140, addrspace 4)
   ; CHECK-NEXT:   [[S_ADD_I32_12:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM3]], -329, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_13:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM3]], -345, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ADD_I32_14:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_BUFFER_LOAD_DWORD_SGPR_IMM6]], -441, implicit-def dead $scc
diff --git a/llvm/test/CodeGen/AMDGPU/sra.ll b/llvm/test/CodeGen/AMDGPU/sra.ll
index 67c51286de216..d0e5d26161184 100644
--- a/llvm/test/CodeGen/AMDGPU/sra.ll
+++ b/llvm/test/CodeGen/AMDGPU/sra.ll
@@ -566,19 +566,19 @@ define amdgpu_kernel void @ashr_v4i64(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b32 s8, s2
 ; SI-NEXT:    s_mov_b32 s9, s3
-; SI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:16
-; SI-NEXT:    buffer_load_dwordx4 v[4:7], off, s[8:11], 0 offset:48
+; SI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:48
+; SI-NEXT:    buffer_load_dwordx4 v[3:6], off, s[8:11], 0 offset:16
 ; SI-NEXT:    buffer_load_dwordx4 v[7:10], off, s[8:11], 0
 ; SI-NEXT:    buffer_load_dwordx4 v[11:14], off, s[8:11], 0 offset:32
 ; SI-NEXT:    s_mov_b32 s4, s0
 ; SI-NEXT:    s_mov_b32 s5, s1
 ; SI-NEXT:    s_waitcnt vmcnt(2)
-; SI-NEXT:    v_ashr_i64 v[2:3], v[2:3], v6
-; SI-NEXT:    v_ashr_i64 v[0:1], v[0:1], v4
+; SI-NEXT:    v_ashr_i64 v[5:6], v[5:6], v2
+; SI-NEXT:    v_ashr_i64 v[3:4], v[3:4], v0
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    v_ashr_i64 v[9:10], v[9:10], v13
 ; SI-NEXT:    v_ashr_i64 v[7:8], v[7:8], v11
-; SI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0 offset:16
+; SI-NEXT:    buffer_store_dwordx4 v[3:6], off, s[4:7], 0 offset:16
 ; SI-NEXT:    buffer_store_dwordx4 v[7:10], off, s[4:7], 0
 ; SI-NEXT:    s_endpgm
 ;
@@ -592,19 +592,19 @@ define amdgpu_kernel void @ashr_v4i64(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_mov_b32 s8, s6
 ; VI-NEXT:    s_mov_b32 s9, s7
-; VI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:16
-; VI-NEXT:    buffer_load_dwordx4 v[4:7], off, s[8:11], 0 offset:48
+; VI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:48
+; VI-NEXT:    buffer_load_dwordx4 v[3:6], off, s[8:11], 0 offset:16
 ; VI-NEXT:    buffer_load_dwordx4 v[7:10], off, s[8:11], 0
 ; VI-NEXT:    buffer_load_dwordx4 v[11:14], off, s[8:11], 0 offset:32
 ; VI-NEXT:    s_mov_b32 s0, s4
 ; VI-NEXT:    s_mov_b32 s1, s5
 ; VI-NEXT:    s_waitcnt vmcnt(2)
-; VI-NEXT:    v_ashrrev_i64 v[2:3], v6, v[2:3]
-; VI-NEXT:    v_ashrrev_i64 v[0:1], v4, v[0:1]
+; VI-NEXT:    v_ashrrev_i64 v[5:6], v2, v[5:6]
+; VI-NEXT:    v_ashrrev_i64 v[3:4], v0, v[3:4]
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    v_ashrrev_i64 v[9:10], v13, v[9:10]
 ; VI-NEXT:    v_ashrrev_i64 v[7:8], v11, v[7:8]
-; VI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[0:3], 0 offset:16
+; VI-NEXT:    buffer_store_dwordx4 v[3:6], off, s[0:3], 0 offset:16
 ; VI-NEXT:    buffer_store_dwordx4 v[7:10], off, s[0:3], 0
 ; VI-NEXT:    s_endpgm
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/srem.ll b/llvm/test/CodeGen/AMDGPU/srem.ll
index 6423267be4b34..6da7d1b7ee868 100644
--- a/llvm/test/CodeGen/AMDGPU/srem.ll
+++ b/llvm/test/CodeGen/AMDGPU/srem.ll
@@ -6092,15 +6092,15 @@ define amdgpu_kernel void @srem_v4i64(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; TONGA-NEXT:    v_mov_b32_e32 v8, 0
 ; TONGA-NEXT:    s_waitcnt lgkmcnt(0)
 ; TONGA-NEXT:    s_add_u32 s0, s6, 48
-; TONGA-NEXT:    s_addc_u32 s1, s7, 0
-; TONGA-NEXT:    s_add_u32 s2, s6, 32
 ; TONGA-NEXT:    v_mov_b32_e32 v0, s6
-; TONGA-NEXT:    s_addc_u32 s3, s7, 0
-; TONGA-NEXT:    v_mov_b32_e32 v2, s2
+; TONGA-NEXT:    s_addc_u32 s1, s7, 0
 ; TONGA-NEXT:    v_mov_b32_e32 v1, s7
-; TONGA-NEXT:    v_mov_b32_e32 v3, s3
-; TONGA-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
+; TONGA-NEXT:    s_add_u32 s2, s6, 32
 ; TONGA-NEXT:    flat_load_dwordx4 v[14:17], v[0:1]
+; TONGA-NEXT:    s_addc_u32 s3, s7, 0
+; TONGA-NEXT:    v_mov_b32_e32 v0, s2
+; TONGA-NEXT:    v_mov_b32_e32 v1, s3
+; TONGA-NEXT:    flat_load_dwordx4 v[10:13], v[0:1]
 ; TONGA-NEXT:    v_mov_b32_e32 v0, s0
 ; TONGA-NEXT:    v_mov_b32_e32 v1, s1
 ; TONGA-NEXT:    s_add_u32 s0, s6, 16
diff --git a/llvm/test/CodeGen/AMDGPU/srl.ll b/llvm/test/CodeGen/AMDGPU/srl.ll
index badb1f6fe9847..239de43baa457 100644
--- a/llvm/test/CodeGen/AMDGPU/srl.ll
+++ b/llvm/test/CodeGen/AMDGPU/srl.ll
@@ -266,19 +266,19 @@ define amdgpu_kernel void @lshr_v4i64(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b32 s8, s2
 ; SI-NEXT:    s_mov_b32 s9, s3
-; SI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:16
-; SI-NEXT:    buffer_load_dwordx4 v[4:7], off, s[8:11], 0 offset:48
+; SI-NEXT:    buffer_load_dwordx4 v[0:3], off, s[8:11], 0 offset:48
+; SI-NEXT:    buffer_load_dwordx4 v[3:6], off, s[8:11], 0 offset:16
 ; SI-NEXT:    buffer_load_dwordx4 v[7:10], off, s[8:11], 0
 ; SI-NEXT:    buffer_load_dwordx4 v[11:14], off, s[8:11], 0 offset:32
 ; SI-NEXT:    s_mov_b32 s4, s0
 ; SI-NEXT:    s_mov_b32 s5, s1
 ; SI-NEXT:    s_waitcnt vmcnt(2)
-; SI-NEXT:    v_lshr_b64 v[2:3], v[2:3], v6
-; SI-NEXT:    v_lshr_b64 v[0:1], v[0:1], v4
+; SI-NEXT:    v_lshr_b64 v[5:6], v[5:6], v2
+; SI-NEXT:    v_lshr_b64 v[3:4], v[3:4], v0
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    v_lshr_b64 v[9:10], v[9:10], v13
 ; SI-NEXT:    v_lshr_b64 v[7:8], v[7:8], v11
-; SI-NEXT:    buffer_store_dwordx4 v[0:3], off, s[4:7], 0 offset:16
+; SI-NEXT:    buffer_store_dwordx4 v[3:6], off, s[4:7], 0 offset:16
 ; SI-NEXT:    buffer_store_dwordx4 v[7:10], off, s[4:7], 0
 ; SI-NEXT:    s_endpgm
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/store-local.128.ll b/llvm/test/CodeGen/AMDGPU/store-local.128.ll
index 76ed4f6238dbe..2efa022efd70f 100644
--- a/llvm/test/CodeGen/AMDGPU/store-local.128.ll
+++ b/llvm/test/CodeGen/AMDGPU/store-local.128.ll
@@ -279,36 +279,37 @@ define amdgpu_kernel void @store_lds_v4i32_align1(ptr addrspace(3) %out, <4 x i3
 ; GFX11-NEXT:    s_load_b128 s[0:3], s[4:5], 0x10
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s3
+; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s1
 ; GFX11-NEXT:    s_lshr_b32 s4, s3, 8
-; GFX11-NEXT:    s_lshr_b32 s3, s3, 24
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s0 :: v_dual_mov_b32 v5, s4
 ; GFX11-NEXT:    s_lshr_b32 s5, s2, 8
-; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s1
 ; GFX11-NEXT:    s_lshr_b32 s2, s2, 24
 ; GFX11-NEXT:    s_lshr_b32 s6, s1, 8
-; GFX11-NEXT:    v_dual_mov_b32 v6, s3 :: v_dual_mov_b32 v7, s5
-; GFX11-NEXT:    v_dual_mov_b32 v8, s2 :: v_dual_mov_b32 v9, s6
-; GFX11-NEXT:    v_dual_mov_b32 v4, s0 :: v_dual_mov_b32 v5, s4
+; GFX11-NEXT:    s_lshr_b32 s3, s3, 24
 ; GFX11-NEXT:    s_lshr_b32 s1, s1, 24
 ; GFX11-NEXT:    s_lshr_b32 s7, s0, 8
 ; GFX11-NEXT:    s_lshr_b32 s0, s0, 24
+; GFX11-NEXT:    v_dual_mov_b32 v8, s2 :: v_dual_mov_b32 v9, s6
+; GFX11-NEXT:    v_dual_mov_b32 v6, s3 :: v_dual_mov_b32 v7, s5
+; GFX11-NEXT:    v_mov_b32_e32 v10, s1
 ; GFX11-NEXT:    ds_store_b8 v0, v2 offset:8
 ; GFX11-NEXT:    ds_store_b8_d16_hi v0, v2 offset:10
 ; GFX11-NEXT:    ds_store_b8 v0, v1 offset:12
-; GFX11-NEXT:    ds_store_b8 v0, v4
-; GFX11-NEXT:    ds_store_b8_d16_hi v0, v4 offset:2
-; GFX11-NEXT:    ds_store_b8 v0, v3 offset:4
-; GFX11-NEXT:    ds_store_b8 v0, v5 offset:13
 ; GFX11-NEXT:    ds_store_b8_d16_hi v0, v1 offset:14
+; GFX11-NEXT:    ds_store_b8 v0, v5 offset:13
 ; GFX11-NEXT:    ds_store_b8 v0, v6 offset:15
-; GFX11-NEXT:    v_dual_mov_b32 v1, s1 :: v_dual_mov_b32 v10, s7
-; GFX11-NEXT:    v_mov_b32_e32 v11, s0
 ; GFX11-NEXT:    ds_store_b8 v0, v7 offset:9
 ; GFX11-NEXT:    ds_store_b8 v0, v8 offset:11
-; GFX11-NEXT:    ds_store_b8 v0, v9 offset:5
+; GFX11-NEXT:    v_dual_mov_b32 v1, s7 :: v_dual_mov_b32 v2, s0
+; GFX11-NEXT:    ds_store_b8 v0, v4
+; GFX11-NEXT:    ds_store_b8_d16_hi v0, v4 offset:2
+; GFX11-NEXT:    ds_store_b8 v0, v3 offset:4
 ; GFX11-NEXT:    ds_store_b8_d16_hi v0, v3 offset:6
-; GFX11-NEXT:    ds_store_b8 v0, v1 offset:7
-; GFX11-NEXT:    ds_store_b8 v0, v10 offset:1
-; GFX11-NEXT:    ds_store_b8 v0, v11 offset:3
+; GFX11-NEXT:    ds_store_b8 v0, v9 offset:5
+; GFX11-NEXT:    ds_store_b8 v0, v10 offset:7
+; GFX11-NEXT:    ds_store_b8 v0, v1 offset:1
+; GFX11-NEXT:    ds_store_b8 v0, v2 offset:3
 ; GFX11-NEXT:    s_endpgm
   store <4 x i32> %x, ptr addrspace(3) %out, align 1
   ret void
@@ -420,17 +421,17 @@ define amdgpu_kernel void @store_lds_v4i32_align2(ptr addrspace(3) %out, <4 x i3
 ; GFX11-NEXT:    s_load_b32 s6, s[4:5], 0x0
 ; GFX11-NEXT:    s_load_b128 s[0:3], s[4:5], 0x10
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s3
-; GFX11-NEXT:    v_dual_mov_b32 v2, s0 :: v_dual_mov_b32 v3, s1
-; GFX11-NEXT:    v_mov_b32_e32 v4, s2
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v1 offset:14
-; GFX11-NEXT:    ds_store_b16 v0, v2
-; GFX11-NEXT:    ds_store_b16 v0, v3 offset:4
-; GFX11-NEXT:    ds_store_b16 v0, v4 offset:8
-; GFX11-NEXT:    ds_store_b16 v0, v1 offset:12
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v4 offset:10
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v3 offset:6
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v2 offset:2
+; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s0
+; GFX11-NEXT:    v_dual_mov_b32 v2, s1 :: v_dual_mov_b32 v3, s2
+; GFX11-NEXT:    v_mov_b32_e32 v4, s3
+; GFX11-NEXT:    ds_store_b16 v0, v1
+; GFX11-NEXT:    ds_store_b16 v0, v2 offset:4
+; GFX11-NEXT:    ds_store_b16 v0, v3 offset:8
+; GFX11-NEXT:    ds_store_b16 v0, v4 offset:12
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v4 offset:14
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v3 offset:10
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v2 offset:6
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v1 offset:2
 ; GFX11-NEXT:    s_endpgm
   store <4 x i32> %x, ptr addrspace(3) %out, align 2
   ret void
@@ -576,8 +577,8 @@ define amdgpu_kernel void @store_lds_v4i32_align8(ptr addrspace(3) %out, <4 x i3
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    v_mov_b32_e32 v4, s6
 ; GFX11-NEXT:    v_mov_b32_e32 v0, s0
-; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
-; GFX11-NEXT:    v_mov_b32_e32 v1, s1
+; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT:    v_mov_b32_e32 v3, s3
 ; GFX11-NEXT:    ds_store_2addr_b64 v4, v[0:1], v[2:3] offset1:1
 ; GFX11-NEXT:    s_endpgm
   store <4 x i32> %x, ptr addrspace(3) %out, align 8
diff --git a/llvm/test/CodeGen/AMDGPU/store-local.96.ll b/llvm/test/CodeGen/AMDGPU/store-local.96.ll
index 70906d8474aa5..03a7ec4883ff8 100644
--- a/llvm/test/CodeGen/AMDGPU/store-local.96.ll
+++ b/llvm/test/CodeGen/AMDGPU/store-local.96.ll
@@ -239,25 +239,26 @@ define amdgpu_kernel void @store_lds_v3i32_align1(ptr addrspace(3) %out, <3 x i3
 ; GFX11-NEXT:    s_load_b128 s[0:3], s[4:5], 0x10
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s2
-; GFX11-NEXT:    v_dual_mov_b32 v2, s1 :: v_dual_mov_b32 v3, s0
 ; GFX11-NEXT:    s_lshr_b32 s3, s2, 8
 ; GFX11-NEXT:    s_lshr_b32 s2, s2, 24
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    v_dual_mov_b32 v4, s3 :: v_dual_mov_b32 v5, s2
+; GFX11-NEXT:    v_dual_mov_b32 v2, s1 :: v_dual_mov_b32 v3, s0
 ; GFX11-NEXT:    s_lshr_b32 s4, s1, 8
 ; GFX11-NEXT:    s_lshr_b32 s1, s1, 24
 ; GFX11-NEXT:    s_lshr_b32 s5, s0, 8
 ; GFX11-NEXT:    s_lshr_b32 s0, s0, 24
-; GFX11-NEXT:    v_dual_mov_b32 v4, s3 :: v_dual_mov_b32 v5, s2
 ; GFX11-NEXT:    v_dual_mov_b32 v6, s4 :: v_dual_mov_b32 v7, s1
 ; GFX11-NEXT:    v_dual_mov_b32 v8, s5 :: v_dual_mov_b32 v9, s0
 ; GFX11-NEXT:    ds_store_b8 v0, v1 offset:8
+; GFX11-NEXT:    ds_store_b8_d16_hi v0, v1 offset:10
+; GFX11-NEXT:    ds_store_b8 v0, v4 offset:9
+; GFX11-NEXT:    ds_store_b8 v0, v5 offset:11
 ; GFX11-NEXT:    ds_store_b8 v0, v3
 ; GFX11-NEXT:    ds_store_b8_d16_hi v0, v3 offset:2
 ; GFX11-NEXT:    ds_store_b8 v0, v2 offset:4
-; GFX11-NEXT:    ds_store_b8 v0, v4 offset:9
-; GFX11-NEXT:    ds_store_b8_d16_hi v0, v1 offset:10
-; GFX11-NEXT:    ds_store_b8 v0, v5 offset:11
-; GFX11-NEXT:    ds_store_b8 v0, v6 offset:5
 ; GFX11-NEXT:    ds_store_b8_d16_hi v0, v2 offset:6
+; GFX11-NEXT:    ds_store_b8 v0, v6 offset:5
 ; GFX11-NEXT:    ds_store_b8 v0, v7 offset:7
 ; GFX11-NEXT:    ds_store_b8 v0, v8 offset:1
 ; GFX11-NEXT:    ds_store_b8 v0, v9 offset:3
@@ -356,14 +357,14 @@ define amdgpu_kernel void @store_lds_v3i32_align2(ptr addrspace(3) %out, <3 x i3
 ; GFX11-NEXT:    s_load_b32 s6, s[4:5], 0x0
 ; GFX11-NEXT:    s_load_b128 s[0:3], s[4:5], 0x10
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s2
-; GFX11-NEXT:    v_dual_mov_b32 v2, s0 :: v_dual_mov_b32 v3, s1
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v1 offset:10
-; GFX11-NEXT:    ds_store_b16 v0, v2
-; GFX11-NEXT:    ds_store_b16 v0, v3 offset:4
-; GFX11-NEXT:    ds_store_b16 v0, v1 offset:8
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v3 offset:6
-; GFX11-NEXT:    ds_store_b16_d16_hi v0, v2 offset:2
+; GFX11-NEXT:    v_dual_mov_b32 v0, s6 :: v_dual_mov_b32 v1, s0
+; GFX11-NEXT:    v_dual_mov_b32 v2, s1 :: v_dual_mov_b32 v3, s2
+; GFX11-NEXT:    ds_store_b16 v0, v1
+; GFX11-NEXT:    ds_store_b16 v0, v2 offset:4
+; GFX11-NEXT:    ds_store_b16 v0, v3 offset:8
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v3 offset:10
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v2 offset:6
+; GFX11-NEXT:    ds_store_b16_d16_hi v0, v1 offset:2
 ; GFX11-NEXT:    s_endpgm
   store <3 x i32> %x, ptr addrspace(3) %out, align 2
   ret void
diff --git a/llvm/test/CodeGen/AMDGPU/sub.ll b/llvm/test/CodeGen/AMDGPU/sub.ll
index a3aeea8a145cd..ec065b4daa376 100644
--- a/llvm/test/CodeGen/AMDGPU/sub.ll
+++ b/llvm/test/CodeGen/AMDGPU/sub.ll
@@ -967,20 +967,20 @@ define amdgpu_kernel void @v_test_sub_v4i64(ptr addrspace(1) %out, ptr addrspace
 ;
 ; GFX9-LABEL: v_test_sub_v4i64:
 ; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX9-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX9-NEXT:    v_lshlrev_b32_e32 v16, 5, v0
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-NEXT:    global_load_dwordx4 v[0:3], v16, s[2:3]
-; GFX9-NEXT:    global_load_dwordx4 v[4:7], v16, s[6:7]
+; GFX9-NEXT:    global_load_dwordx4 v[0:3], v16, s[6:7]
+; GFX9-NEXT:    global_load_dwordx4 v[4:7], v16, s[2:3]
 ; GFX9-NEXT:    global_load_dwordx4 v[8:11], v16, s[2:3] offset:16
 ; GFX9-NEXT:    global_load_dwordx4 v[12:15], v16, s[6:7] offset:16
 ; GFX9-NEXT:    v_mov_b32_e32 v16, 0
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    v_sub_co_u32_e32 v2, vcc, v2, v6
-; GFX9-NEXT:    v_subb_co_u32_e32 v3, vcc, v3, v7, vcc
-; GFX9-NEXT:    v_sub_co_u32_e32 v0, vcc, v0, v4
-; GFX9-NEXT:    v_subb_co_u32_e32 v1, vcc, v1, v5, vcc
+; GFX9-NEXT:    v_sub_co_u32_e32 v2, vcc, v6, v2
+; GFX9-NEXT:    v_subb_co_u32_e32 v3, vcc, v7, v3, vcc
+; GFX9-NEXT:    v_sub_co_u32_e32 v0, vcc, v4, v0
+; GFX9-NEXT:    v_subb_co_u32_e32 v1, vcc, v5, v1, vcc
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    v_sub_co_u32_e32 v6, vcc, v10, v14
 ; GFX9-NEXT:    v_subb_co_u32_e32 v7, vcc, v11, v15, vcc
@@ -993,22 +993,22 @@ define amdgpu_kernel void @v_test_sub_v4i64(ptr addrspace(1) %out, ptr addrspace
 ; GFX12-LABEL: v_test_sub_v4i64:
 ; GFX12:       ; %bb.0:
 ; GFX12-NEXT:    s_clause 0x1
+; GFX12-NEXT:    s_load_b64 s[6:7], s[4:5], 0x34
 ; GFX12-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX12-NEXT:    s_load_b64 s[4:5], s[4:5], 0x34
 ; GFX12-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX12-NEXT:    v_mov_b32_e32 v16, 0
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX12-NEXT:    v_lshlrev_b32_e32 v12, 5, v0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    s_clause 0x3
-; GFX12-NEXT:    global_load_b128 v[0:3], v12, s[2:3]
-; GFX12-NEXT:    global_load_b128 v[4:7], v12, s[4:5]
+; GFX12-NEXT:    global_load_b128 v[0:3], v12, s[6:7]
+; GFX12-NEXT:    global_load_b128 v[4:7], v12, s[2:3]
 ; GFX12-NEXT:    global_load_b128 v[8:11], v12, s[2:3] offset:16
-; GFX12-NEXT:    global_load_b128 v[12:15], v12, s[4:5] offset:16
+; GFX12-NEXT:    global_load_b128 v[12:15], v12, s[6:7] offset:16
 ; GFX12-NEXT:    s_wait_loadcnt 0x2
-; GFX12-NEXT:    v_sub_co_u32 v2, vcc_lo, v2, v6
+; GFX12-NEXT:    v_sub_co_u32 v2, vcc_lo, v6, v2
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12-NEXT:    v_sub_co_ci_u32_e64 v3, null, v3, v7, vcc_lo
+; GFX12-NEXT:    v_sub_co_ci_u32_e64 v3, null, v7, v3, vcc_lo
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    v_sub_co_u32 v10, vcc_lo, v10, v14
 ; GFX12-NEXT:    s_wait_alu 0xfffd
@@ -1016,9 +1016,9 @@ define amdgpu_kernel void @v_test_sub_v4i64(ptr addrspace(1) %out, ptr addrspace
 ; GFX12-NEXT:    v_sub_co_u32 v8, vcc_lo, v8, v12
 ; GFX12-NEXT:    s_wait_alu 0xfffd
 ; GFX12-NEXT:    v_sub_co_ci_u32_e64 v9, null, v9, v13, vcc_lo
-; GFX12-NEXT:    v_sub_co_u32 v0, vcc_lo, v0, v4
+; GFX12-NEXT:    v_sub_co_u32 v0, vcc_lo, v4, v0
 ; GFX12-NEXT:    s_wait_alu 0xfffd
-; GFX12-NEXT:    v_sub_co_ci_u32_e64 v1, null, v1, v5, vcc_lo
+; GFX12-NEXT:    v_sub_co_ci_u32_e64 v1, null, v5, v1, vcc_lo
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    global_store_b128 v16, v[8:11], s[0:1] offset:16
 ; GFX12-NEXT:    global_store_b128 v16, v[0:3], s[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/udivrem.ll b/llvm/test/CodeGen/AMDGPU/udivrem.ll
index a56346f3bb45b..74e536f813716 100644
--- a/llvm/test/CodeGen/AMDGPU/udivrem.ll
+++ b/llvm/test/CodeGen/AMDGPU/udivrem.ll
@@ -37,22 +37,23 @@ define amdgpu_kernel void @test_udivrem(ptr addrspace(1) %out0, [8 x i32], ptr a
 ; GFX6-LABEL: test_udivrem:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_load_dword s8, s[4:5], 0x26
-; GFX6-NEXT:    s_load_dword s9, s[4:5], 0x1d
 ; GFX6-NEXT:    s_mov_b32 s3, 0xf000
 ; GFX6-NEXT:    s_mov_b32 s2, -1
 ; GFX6-NEXT:    s_mov_b32 s6, s2
+; GFX6-NEXT:    s_mov_b32 s7, s3
 ; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX6-NEXT:    v_cvt_f32_u32_e32 v0, s8
 ; GFX6-NEXT:    s_sub_i32 s0, 0, s8
-; GFX6-NEXT:    s_mov_b32 s7, s3
 ; GFX6-NEXT:    v_rcp_iflag_f32_e32 v0, v0
 ; GFX6-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
 ; GFX6-NEXT:    v_cvt_u32_f32_e32 v0, v0
 ; GFX6-NEXT:    v_mul_lo_u32 v1, s0, v0
+; GFX6-NEXT:    s_load_dword s9, s[4:5], 0x1d
 ; GFX6-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x9
 ; GFX6-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x13
 ; GFX6-NEXT:    v_mul_hi_u32 v1, v0, v1
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
+; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX6-NEXT:    v_mul_hi_u32 v0, s9, v0
 ; GFX6-NEXT:    v_readfirstlane_b32 s10, v0
 ; GFX6-NEXT:    s_mul_i32 s10, s10, s8
@@ -69,7 +70,6 @@ define amdgpu_kernel void @test_udivrem(ptr addrspace(1) %out0, [8 x i32], ptr a
 ; GFX6-NEXT:    s_cselect_b64 vcc, -1, 0
 ; GFX6-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
 ; GFX6-NEXT:    s_cselect_b32 s8, s10, s9
-; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX6-NEXT:    buffer_store_dword v0, off, s[0:3], 0
 ; GFX6-NEXT:    s_waitcnt expcnt(0)
 ; GFX6-NEXT:    v_mov_b32_e32 v0, s8
@@ -79,7 +79,6 @@ define amdgpu_kernel void @test_udivrem(ptr addrspace(1) %out0, [8 x i32], ptr a
 ; GFX8-LABEL: test_udivrem:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_load_dword s6, s[4:5], 0x98
-; GFX8-NEXT:    s_load_dword s7, s[4:5], 0x74
 ; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX8-NEXT:    v_cvt_f32_u32_e32 v0, s6
 ; GFX8-NEXT:    s_sub_i32 s0, 0, s6
@@ -87,6 +86,7 @@ define amdgpu_kernel void @test_udivrem(ptr addrspace(1) %out0, [8 x i32], ptr a
 ; GFX8-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
 ; GFX8-NEXT:    v_cvt_u32_f32_e32 v0, v0
 ; GFX8-NEXT:    v_mul_lo_u32 v1, s0, v0
+; GFX8-NEXT:    s_load_dword s7, s[4:5], 0x74
 ; GFX8-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
 ; GFX8-NEXT:    s_load_dwordx2 s[2:3], s[4:5], 0x4c
 ; GFX8-NEXT:    v_mul_hi_u32 v1, v0, v1
diff --git a/llvm/test/CodeGen/PowerPC/p10-fi-elim.ll b/llvm/test/CodeGen/PowerPC/p10-fi-elim.ll
index f70f95b428ff7..3f6838afd545b 100644
--- a/llvm/test/CodeGen/PowerPC/p10-fi-elim.ll
+++ b/llvm/test/CodeGen/PowerPC/p10-fi-elim.ll
@@ -44,10 +44,10 @@ define dso_local signext i32 @test_FI_elim(ptr noalias nocapture dereferenceable
 ; CHECK-NEXT:    mfvsrd r10, v3
 ; CHECK-NEXT:    std r5, 0(r3)
 ; CHECK-NEXT:    lbz r5, 2(r7)
-; CHECK-NEXT:    mr r7, r9
 ; CHECK-NEXT:    stb r11, 0(r3)
 ; CHECK-NEXT:    stb r12, 0(r3)
 ; CHECK-NEXT:    std r2, 0(r3)
+; CHECK-NEXT:    mr r7, r9
 ; CHECK-NEXT:    neg r10, r10
 ; CHECK-NEXT:    rlwinm r5, r5, 0, 27, 27
 ; CHECK-NEXT:    stb r5, 0(0)
@@ -93,10 +93,10 @@ define dso_local signext i32 @test_FI_elim(ptr noalias nocapture dereferenceable
 ; CHECK-BE-NEXT:    neg r5, r5
 ; CHECK-BE-NEXT:    std r5, 0(r3)
 ; CHECK-BE-NEXT:    lbz r5, 2(r7)
-; CHECK-BE-NEXT:    mr r7, r9
 ; CHECK-BE-NEXT:    stb r11, 0(r3)
 ; CHECK-BE-NEXT:    stb r12, 0(r3)
 ; CHECK-BE-NEXT:    std r30, 0(r3)
+; CHECK-BE-NEXT:    mr r7, r9
 ; CHECK-BE-NEXT:    neg r10, r10
 ; CHECK-BE-NEXT:    rlwinm r5, r5, 0, 27, 27
 ; CHECK-BE-NEXT:    stb r5, 0(0)
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
index bc002fee4417c..69519c00f88ea 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
@@ -1496,23 +1496,23 @@ define void @shl_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli t2, t2, 8
 ; RV32I-NEXT:    or a4, a6, a5
 ; RV32I-NEXT:    or a5, t0, a7
-; RV32I-NEXT:    lbu a6, 0(a1)
-; RV32I-NEXT:    lbu a7, 1(a1)
-; RV32I-NEXT:    or t0, t2, t1
+; RV32I-NEXT:    or a7, t2, t1
+; RV32I-NEXT:    lbu a6, 1(a1)
+; RV32I-NEXT:    lbu t0, 0(a1)
 ; RV32I-NEXT:    lbu t1, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a7, a7, 8
-; RV32I-NEXT:    or a7, a7, a6
+; RV32I-NEXT:    slli a6, a6, 8
+; RV32I-NEXT:    or t2, a6, t0
 ; RV32I-NEXT:    li a6, 64
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t1
 ; RV32I-NEXT:    li t1, 32
 ; RV32I-NEXT:    slli a4, a4, 16
-; RV32I-NEXT:    slli t2, t0, 16
+; RV32I-NEXT:    slli a7, a7, 16
 ; RV32I-NEXT:    slli a1, a1, 16
 ; RV32I-NEXT:    or t0, a4, a3
-; RV32I-NEXT:    or a4, t2, a5
-; RV32I-NEXT:    or a5, a1, a7
+; RV32I-NEXT:    or a4, a7, a5
+; RV32I-NEXT:    or a5, a1, t2
 ; RV32I-NEXT:    slli a5, a5, 3
 ; RV32I-NEXT:    neg t3, a5
 ; RV32I-NEXT:    srl t4, t0, t3
@@ -1825,23 +1825,23 @@ define void @shl_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV32I-NEXT:    slli t2, t2, 8
 ; RV32I-NEXT:    or a4, a6, a5
 ; RV32I-NEXT:    or a5, t0, a7
-; RV32I-NEXT:    lbu a6, 0(a1)
-; RV32I-NEXT:    lbu a7, 1(a1)
-; RV32I-NEXT:    or t0, t2, t1
+; RV32I-NEXT:    or a7, t2, t1
+; RV32I-NEXT:    lbu a6, 1(a1)
+; RV32I-NEXT:    lbu t0, 0(a1)
 ; RV32I-NEXT:    lbu t1, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a7, a7, 8
-; RV32I-NEXT:    or a7, a7, a6
+; RV32I-NEXT:    slli a6, a6, 8
+; RV32I-NEXT:    or t2, a6, t0
 ; RV32I-NEXT:    li a6, 64
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t1
 ; RV32I-NEXT:    li t1, 32
 ; RV32I-NEXT:    slli a4, a4, 16
-; RV32I-NEXT:    slli t2, t0, 16
+; RV32I-NEXT:    slli a7, a7, 16
 ; RV32I-NEXT:    slli a1, a1, 16
 ; RV32I-NEXT:    or t0, a4, a3
-; RV32I-NEXT:    or a4, t2, a5
-; RV32I-NEXT:    or a5, a1, a7
+; RV32I-NEXT:    or a4, a7, a5
+; RV32I-NEXT:    or a5, a1, t2
 ; RV32I-NEXT:    slli a5, a5, 5
 ; RV32I-NEXT:    neg t3, a5
 ; RV32I-NEXT:    srl t4, t0, t3
@@ -5784,23 +5784,23 @@ define void @shl_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli t3, t3, 8
 ; RV32I-NEXT:    or a5, a7, a5
 ; RV32I-NEXT:    or a7, t1, t0
-; RV32I-NEXT:    lbu t0, 0(a1)
+; RV32I-NEXT:    or t0, t3, t2
 ; RV32I-NEXT:    lbu t1, 1(a1)
-; RV32I-NEXT:    or t2, t3, t2
+; RV32I-NEXT:    lbu t2, 0(a1)
 ; RV32I-NEXT:    lbu t3, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
 ; RV32I-NEXT:    slli t1, t1, 8
-; RV32I-NEXT:    or t0, t1, t0
+; RV32I-NEXT:    or t1, t1, t2
 ; RV32I-NEXT:    li s9, 64
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t3
 ; RV32I-NEXT:    li t4, 32
 ; RV32I-NEXT:    slli a5, a5, 16
-; RV32I-NEXT:    slli t2, t2, 16
+; RV32I-NEXT:    slli t0, t0, 16
 ; RV32I-NEXT:    slli a1, a1, 16
 ; RV32I-NEXT:    or t3, a5, a4
-; RV32I-NEXT:    or a5, t2, a7
-; RV32I-NEXT:    or a4, a1, t0
+; RV32I-NEXT:    or a5, t0, a7
+; RV32I-NEXT:    or a4, a1, t1
 ; RV32I-NEXT:    slli a4, a4, 3
 ; RV32I-NEXT:    neg s10, a4
 ; RV32I-NEXT:    srl t5, t3, s10
@@ -6698,23 +6698,23 @@ define void @shl_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV32I-NEXT:    slli t3, t3, 8
 ; RV32I-NEXT:    or a5, a7, a5
 ; RV32I-NEXT:    or a7, t1, t0
-; RV32I-NEXT:    lbu t0, 0(a1)
+; RV32I-NEXT:    or t0, t3, t2
 ; RV32I-NEXT:    lbu t1, 1(a1)
-; RV32I-NEXT:    or t2, t3, t2
+; RV32I-NEXT:    lbu t2, 0(a1)
 ; RV32I-NEXT:    lbu t3, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
 ; RV32I-NEXT:    slli t1, t1, 8
-; RV32I-NEXT:    or t0, t1, t0
+; RV32I-NEXT:    or t1, t1, t2
 ; RV32I-NEXT:    li s9, 64
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t3
 ; RV32I-NEXT:    li t4, 32
 ; RV32I-NEXT:    slli a5, a5, 16
-; RV32I-NEXT:    slli t2, t2, 16
+; RV32I-NEXT:    slli t0, t0, 16
 ; RV32I-NEXT:    slli a1, a1, 16
 ; RV32I-NEXT:    or t3, a5, a4
-; RV32I-NEXT:    or a5, t2, a7
-; RV32I-NEXT:    or a4, a1, t0
+; RV32I-NEXT:    or a5, t0, a7
+; RV32I-NEXT:    or a4, a1, t1
 ; RV32I-NEXT:    slli a4, a4, 5
 ; RV32I-NEXT:    neg s10, a4
 ; RV32I-NEXT:    srl t5, t3, s10
@@ -7612,23 +7612,23 @@ define void @shl_32bytes_dwordOff(ptr %src.ptr, ptr %dwordOff.ptr, ptr %dst) nou
 ; RV32I-NEXT:    slli t3, t3, 8
 ; RV32I-NEXT:    or a5, a7, a5
 ; RV32I-NEXT:    or a7, t1, t0
-; RV32I-NEXT:    lbu t0, 0(a1)
+; RV32I-NEXT:    or t0, t3, t2
 ; RV32I-NEXT:    lbu t1, 1(a1)
-; RV32I-NEXT:    or t2, t3, t2
+; RV32I-NEXT:    lbu t2, 0(a1)
 ; RV32I-NEXT:    lbu t3, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
 ; RV32I-NEXT:    slli t1, t1, 8
-; RV32I-NEXT:    or t0, t1, t0
+; RV32I-NEXT:    or t1, t1, t2
 ; RV32I-NEXT:    li s9, 64
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t3
 ; RV32I-NEXT:    li t4, 32
 ; RV32I-NEXT:    slli a5, a5, 16
-; RV32I-NEXT:    slli t2, t2, 16
+; RV32I-NEXT:    slli t0, t0, 16
 ; RV32I-NEXT:    slli a1, a1, 16
 ; RV32I-NEXT:    or t3, a5, a4
-; RV32I-NEXT:    or a5, t2, a7
-; RV32I-NEXT:    or a4, a1, t0
+; RV32I-NEXT:    or a5, t0, a7
+; RV32I-NEXT:    or a4, a1, t1
 ; RV32I-NEXT:    slli a4, a4, 6
 ; RV32I-NEXT:    neg s10, a4
 ; RV32I-NEXT:    srl t5, t3, s10
diff --git a/llvm/test/CodeGen/RISCV/abds-neg.ll b/llvm/test/CodeGen/RISCV/abds-neg.ll
index c9a48acb8d14a..3fb0f2c53bdf0 100644
--- a/llvm/test/CodeGen/RISCV/abds-neg.ll
+++ b/llvm/test/CodeGen/RISCV/abds-neg.ll
@@ -625,9 +625,9 @@ define i128 @abd_ext_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    lw a4, 4(a1)
 ; RV32I-NEXT:    lw a6, 8(a1)
 ; RV32I-NEXT:    lw t1, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
 ; RV32I-NEXT:    lw t0, 8(a2)
 ; RV32I-NEXT:    lw t2, 12(a2)
-; RV32I-NEXT:    lw a1, 0(a2)
 ; RV32I-NEXT:    lw a2, 4(a2)
 ; RV32I-NEXT:    sltu t3, t0, a6
 ; RV32I-NEXT:    mv t4, t3
@@ -744,9 +744,9 @@ define i128 @abd_ext_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
 ; RV32ZBB-NEXT:    lw a6, 8(a1)
 ; RV32ZBB-NEXT:    lw t1, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
 ; RV32ZBB-NEXT:    lw t0, 8(a2)
 ; RV32ZBB-NEXT:    lw t2, 12(a2)
-; RV32ZBB-NEXT:    lw a1, 0(a2)
 ; RV32ZBB-NEXT:    lw a2, 4(a2)
 ; RV32ZBB-NEXT:    sltu t3, t0, a6
 ; RV32ZBB-NEXT:    mv t4, t3
@@ -872,9 +872,9 @@ define i128 @abd_ext_i128_undef(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    lw a4, 4(a1)
 ; RV32I-NEXT:    lw a6, 8(a1)
 ; RV32I-NEXT:    lw t1, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
 ; RV32I-NEXT:    lw t0, 8(a2)
 ; RV32I-NEXT:    lw t2, 12(a2)
-; RV32I-NEXT:    lw a1, 0(a2)
 ; RV32I-NEXT:    lw a2, 4(a2)
 ; RV32I-NEXT:    sltu t3, t0, a6
 ; RV32I-NEXT:    mv t4, t3
@@ -991,9 +991,9 @@ define i128 @abd_ext_i128_undef(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
 ; RV32ZBB-NEXT:    lw a6, 8(a1)
 ; RV32ZBB-NEXT:    lw t1, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
 ; RV32ZBB-NEXT:    lw t0, 8(a2)
 ; RV32ZBB-NEXT:    lw t2, 12(a2)
-; RV32ZBB-NEXT:    lw a1, 0(a2)
 ; RV32ZBB-NEXT:    lw a2, 4(a2)
 ; RV32ZBB-NEXT:    sltu t3, t0, a6
 ; RV32ZBB-NEXT:    mv t4, t3
@@ -1385,8 +1385,8 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    lw a6, 4(a2)
 ; RV32I-NEXT:    lw a7, 8(a2)
 ; RV32I-NEXT:    lw t0, 12(a2)
-; RV32I-NEXT:    lw a5, 12(a1)
 ; RV32I-NEXT:    lw a3, 4(a1)
+; RV32I-NEXT:    lw a5, 12(a1)
 ; RV32I-NEXT:    lw a4, 8(a1)
 ; RV32I-NEXT:    beq a5, t0, .LBB17_2
 ; RV32I-NEXT:  # %bb.1:
@@ -1512,8 +1512,8 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    lw a6, 4(a2)
 ; RV32ZBB-NEXT:    lw a7, 8(a2)
 ; RV32ZBB-NEXT:    lw t0, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 12(a1)
 ; RV32ZBB-NEXT:    lw a3, 4(a1)
+; RV32ZBB-NEXT:    lw a5, 12(a1)
 ; RV32ZBB-NEXT:    lw a4, 8(a1)
 ; RV32ZBB-NEXT:    beq a5, t0, .LBB17_2
 ; RV32ZBB-NEXT:  # %bb.1:
@@ -1864,15 +1864,15 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    lw a4, 4(a2)
 ; RV32I-NEXT:    lw a5, 8(a2)
 ; RV32I-NEXT:    lw a7, 12(a2)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
 ; RV32I-NEXT:    lw a2, 0(a1)
+; RV32I-NEXT:    lw a6, 8(a1)
+; RV32I-NEXT:    lw t1, 12(a1)
 ; RV32I-NEXT:    lw a1, 4(a1)
-; RV32I-NEXT:    sltu t1, a6, a5
-; RV32I-NEXT:    mv t4, t1
-; RV32I-NEXT:    beq t0, a7, .LBB22_2
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq t1, a7, .LBB22_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    slt t4, t0, a7
+; RV32I-NEXT:    slt t4, t1, a7
 ; RV32I-NEXT:  .LBB22_2:
 ; RV32I-NEXT:    sltu t2, a2, a3
 ; RV32I-NEXT:    mv t3, t2
@@ -1880,7 +1880,7 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    sltu t3, a1, a4
 ; RV32I-NEXT:  .LBB22_4:
-; RV32I-NEXT:    xor t5, t0, a7
+; RV32I-NEXT:    xor t5, t1, a7
 ; RV32I-NEXT:    xor t6, a6, a5
 ; RV32I-NEXT:    or t5, t6, t5
 ; RV32I-NEXT:    mv t6, t3
@@ -1896,11 +1896,11 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:  .LBB22_8:
 ; RV32I-NEXT:    bnez t6, .LBB22_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t1, a5, a6
-; RV32I-NEXT:    sub a7, a7, t0
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
 ; RV32I-NEXT:    sub a5, a5, a6
 ; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a6, a7, t1
+; RV32I-NEXT:    sub a6, a7, t0
 ; RV32I-NEXT:    sltu a7, a5, t5
 ; RV32I-NEXT:    sub a1, a5, t5
 ; RV32I-NEXT:    sub a5, a4, t4
@@ -1908,10 +1908,10 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    sub a2, a3, a2
 ; RV32I-NEXT:    j .LBB22_11
 ; RV32I-NEXT:  .LBB22_10:
-; RV32I-NEXT:    sub a7, t0, a7
+; RV32I-NEXT:    sub a7, t1, a7
 ; RV32I-NEXT:    sub a5, a6, a5
 ; RV32I-NEXT:    sub a4, a1, a4
-; RV32I-NEXT:    sub a6, a7, t1
+; RV32I-NEXT:    sub a6, a7, t0
 ; RV32I-NEXT:    sltu a7, a5, t3
 ; RV32I-NEXT:    sub a1, a5, t3
 ; RV32I-NEXT:    sub a5, a4, t2
@@ -1951,15 +1951,15 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    lw a4, 4(a2)
 ; RV32ZBB-NEXT:    lw a5, 8(a2)
 ; RV32ZBB-NEXT:    lw a7, 12(a2)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
 ; RV32ZBB-NEXT:    lw a2, 0(a1)
+; RV32ZBB-NEXT:    lw a6, 8(a1)
+; RV32ZBB-NEXT:    lw t1, 12(a1)
 ; RV32ZBB-NEXT:    lw a1, 4(a1)
-; RV32ZBB-NEXT:    sltu t1, a6, a5
-; RV32ZBB-NEXT:    mv t4, t1
-; RV32ZBB-NEXT:    beq t0, a7, .LBB22_2
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq t1, a7, .LBB22_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    slt t4, t0, a7
+; RV32ZBB-NEXT:    slt t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB22_2:
 ; RV32ZBB-NEXT:    sltu t2, a2, a3
 ; RV32ZBB-NEXT:    mv t3, t2
@@ -1967,7 +1967,7 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    sltu t3, a1, a4
 ; RV32ZBB-NEXT:  .LBB22_4:
-; RV32ZBB-NEXT:    xor t5, t0, a7
+; RV32ZBB-NEXT:    xor t5, t1, a7
 ; RV32ZBB-NEXT:    xor t6, a6, a5
 ; RV32ZBB-NEXT:    or t5, t6, t5
 ; RV32ZBB-NEXT:    mv t6, t3
@@ -1983,11 +1983,11 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:  .LBB22_8:
 ; RV32ZBB-NEXT:    bnez t6, .LBB22_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t1, a5, a6
-; RV32ZBB-NEXT:    sub a7, a7, t0
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
 ; RV32ZBB-NEXT:    sub a5, a5, a6
 ; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a6, a7, t1
+; RV32ZBB-NEXT:    sub a6, a7, t0
 ; RV32ZBB-NEXT:    sltu a7, a5, t5
 ; RV32ZBB-NEXT:    sub a1, a5, t5
 ; RV32ZBB-NEXT:    sub a5, a4, t4
@@ -1995,10 +1995,10 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    sub a2, a3, a2
 ; RV32ZBB-NEXT:    j .LBB22_11
 ; RV32ZBB-NEXT:  .LBB22_10:
-; RV32ZBB-NEXT:    sub a7, t0, a7
+; RV32ZBB-NEXT:    sub a7, t1, a7
 ; RV32ZBB-NEXT:    sub a5, a6, a5
 ; RV32ZBB-NEXT:    sub a4, a1, a4
-; RV32ZBB-NEXT:    sub a6, a7, t1
+; RV32ZBB-NEXT:    sub a6, a7, t0
 ; RV32ZBB-NEXT:    sltu a7, a5, t3
 ; RV32ZBB-NEXT:    sub a1, a5, t3
 ; RV32ZBB-NEXT:    sub a5, a4, t2
diff --git a/llvm/test/CodeGen/RISCV/abds.ll b/llvm/test/CodeGen/RISCV/abds.ll
index 56e6dacff9748..efb4e1a6f15d6 100644
--- a/llvm/test/CodeGen/RISCV/abds.ll
+++ b/llvm/test/CodeGen/RISCV/abds.ll
@@ -536,73 +536,73 @@ define i128 @abd_ext_i128(i128 %a, i128 %b) nounwind {
 ; RV32I:       # %bb.0:
 ; RV32I-NEXT:    lw a3, 0(a1)
 ; RV32I-NEXT:    lw a4, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
-; RV32I-NEXT:    lw a7, 8(a2)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a7, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
+; RV32I-NEXT:    lw a6, 8(a2)
 ; RV32I-NEXT:    lw t1, 12(a2)
-; RV32I-NEXT:    lw a5, 0(a2)
-; RV32I-NEXT:    lw a1, 4(a2)
-; RV32I-NEXT:    sltu a2, a7, a6
-; RV32I-NEXT:    mv t4, a2
-; RV32I-NEXT:    beq t0, t1, .LBB11_2
+; RV32I-NEXT:    lw a2, 4(a2)
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq a7, t1, .LBB11_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    slt t4, t1, t0
+; RV32I-NEXT:    slt t4, t1, a7
 ; RV32I-NEXT:  .LBB11_2:
-; RV32I-NEXT:    sltu t2, a5, a3
-; RV32I-NEXT:    sltu t5, a1, a4
+; RV32I-NEXT:    sltu t2, a1, a3
+; RV32I-NEXT:    sltu t5, a2, a4
 ; RV32I-NEXT:    mv t3, t2
-; RV32I-NEXT:    beq a4, a1, .LBB11_4
+; RV32I-NEXT:    beq a4, a2, .LBB11_4
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    mv t3, t5
 ; RV32I-NEXT:  .LBB11_4:
 ; RV32I-NEXT:    addi sp, sp, -16
 ; RV32I-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    xor t6, t0, t1
-; RV32I-NEXT:    xor s0, a6, a7
+; RV32I-NEXT:    xor t6, a7, t1
+; RV32I-NEXT:    xor s0, a5, a6
 ; RV32I-NEXT:    or t6, s0, t6
 ; RV32I-NEXT:    beqz t6, .LBB11_6
 ; RV32I-NEXT:  # %bb.5:
 ; RV32I-NEXT:    mv t3, t4
 ; RV32I-NEXT:  .LBB11_6:
 ; RV32I-NEXT:    mv t4, t2
-; RV32I-NEXT:    beq a1, a4, .LBB11_8
+; RV32I-NEXT:    beq a2, a4, .LBB11_8
 ; RV32I-NEXT:  # %bb.7:
 ; RV32I-NEXT:    mv t4, t5
 ; RV32I-NEXT:  .LBB11_8:
-; RV32I-NEXT:    sltu t5, a3, a5
+; RV32I-NEXT:    sltu t5, a3, a1
 ; RV32I-NEXT:    mv t6, t5
-; RV32I-NEXT:    beq a4, a1, .LBB11_10
+; RV32I-NEXT:    beq a4, a2, .LBB11_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t6, a4, a1
+; RV32I-NEXT:    sltu t6, a4, a2
 ; RV32I-NEXT:  .LBB11_10:
 ; RV32I-NEXT:    bnez t3, .LBB11_12
 ; RV32I-NEXT:  # %bb.11:
-; RV32I-NEXT:    sub t0, t1, t0
-; RV32I-NEXT:    sub a6, a7, a6
-; RV32I-NEXT:    sub a3, a5, a3
-; RV32I-NEXT:    sub a1, a1, a4
-; RV32I-NEXT:    sub a4, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t4
-; RV32I-NEXT:    sub a2, a1, t2
-; RV32I-NEXT:    sub a1, a4, a5
-; RV32I-NEXT:    sub a4, a6, t4
+; RV32I-NEXT:    sub a7, t1, a7
+; RV32I-NEXT:    sub a5, a6, a5
+; RV32I-NEXT:    sub a1, a1, a3
+; RV32I-NEXT:    sub a2, a2, a4
+; RV32I-NEXT:    sub a4, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t4
+; RV32I-NEXT:    sub a3, a2, t2
+; RV32I-NEXT:    sub a2, a4, a6
+; RV32I-NEXT:    sub a4, a5, t4
 ; RV32I-NEXT:    j .LBB11_13
 ; RV32I-NEXT:  .LBB11_12:
-; RV32I-NEXT:    sltu a2, a6, a7
-; RV32I-NEXT:    sub t0, t0, t1
-; RV32I-NEXT:    sub a6, a6, a7
-; RV32I-NEXT:    sub a3, a3, a5
-; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a1, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t6
-; RV32I-NEXT:    sub a2, a4, t5
-; RV32I-NEXT:    sub a1, a1, a5
-; RV32I-NEXT:    sub a4, a6, t6
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
+; RV32I-NEXT:    sub a5, a5, a6
+; RV32I-NEXT:    sub a1, a3, a1
+; RV32I-NEXT:    sub a4, a4, a2
+; RV32I-NEXT:    sub a2, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t6
+; RV32I-NEXT:    sub a3, a4, t5
+; RV32I-NEXT:    sub a2, a2, a6
+; RV32I-NEXT:    sub a4, a5, t6
 ; RV32I-NEXT:  .LBB11_13:
-; RV32I-NEXT:    sw a3, 0(a0)
-; RV32I-NEXT:    sw a2, 4(a0)
+; RV32I-NEXT:    sw a1, 0(a0)
+; RV32I-NEXT:    sw a3, 4(a0)
 ; RV32I-NEXT:    sw a4, 8(a0)
-; RV32I-NEXT:    sw a1, 12(a0)
+; RV32I-NEXT:    sw a2, 12(a0)
 ; RV32I-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
@@ -632,73 +632,73 @@ define i128 @abd_ext_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB:       # %bb.0:
 ; RV32ZBB-NEXT:    lw a3, 0(a1)
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
-; RV32ZBB-NEXT:    lw a7, 8(a2)
+; RV32ZBB-NEXT:    lw a5, 8(a1)
+; RV32ZBB-NEXT:    lw a7, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
+; RV32ZBB-NEXT:    lw a6, 8(a2)
 ; RV32ZBB-NEXT:    lw t1, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 0(a2)
-; RV32ZBB-NEXT:    lw a1, 4(a2)
-; RV32ZBB-NEXT:    sltu a2, a7, a6
-; RV32ZBB-NEXT:    mv t4, a2
-; RV32ZBB-NEXT:    beq t0, t1, .LBB11_2
+; RV32ZBB-NEXT:    lw a2, 4(a2)
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq a7, t1, .LBB11_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    slt t4, t1, t0
+; RV32ZBB-NEXT:    slt t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB11_2:
-; RV32ZBB-NEXT:    sltu t2, a5, a3
-; RV32ZBB-NEXT:    sltu t5, a1, a4
+; RV32ZBB-NEXT:    sltu t2, a1, a3
+; RV32ZBB-NEXT:    sltu t5, a2, a4
 ; RV32ZBB-NEXT:    mv t3, t2
-; RV32ZBB-NEXT:    beq a4, a1, .LBB11_4
+; RV32ZBB-NEXT:    beq a4, a2, .LBB11_4
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    mv t3, t5
 ; RV32ZBB-NEXT:  .LBB11_4:
 ; RV32ZBB-NEXT:    addi sp, sp, -16
 ; RV32ZBB-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32ZBB-NEXT:    xor t6, t0, t1
-; RV32ZBB-NEXT:    xor s0, a6, a7
+; RV32ZBB-NEXT:    xor t6, a7, t1
+; RV32ZBB-NEXT:    xor s0, a5, a6
 ; RV32ZBB-NEXT:    or t6, s0, t6
 ; RV32ZBB-NEXT:    beqz t6, .LBB11_6
 ; RV32ZBB-NEXT:  # %bb.5:
 ; RV32ZBB-NEXT:    mv t3, t4
 ; RV32ZBB-NEXT:  .LBB11_6:
 ; RV32ZBB-NEXT:    mv t4, t2
-; RV32ZBB-NEXT:    beq a1, a4, .LBB11_8
+; RV32ZBB-NEXT:    beq a2, a4, .LBB11_8
 ; RV32ZBB-NEXT:  # %bb.7:
 ; RV32ZBB-NEXT:    mv t4, t5
 ; RV32ZBB-NEXT:  .LBB11_8:
-; RV32ZBB-NEXT:    sltu t5, a3, a5
+; RV32ZBB-NEXT:    sltu t5, a3, a1
 ; RV32ZBB-NEXT:    mv t6, t5
-; RV32ZBB-NEXT:    beq a4, a1, .LBB11_10
+; RV32ZBB-NEXT:    beq a4, a2, .LBB11_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t6, a4, a1
+; RV32ZBB-NEXT:    sltu t6, a4, a2
 ; RV32ZBB-NEXT:  .LBB11_10:
 ; RV32ZBB-NEXT:    bnez t3, .LBB11_12
 ; RV32ZBB-NEXT:  # %bb.11:
-; RV32ZBB-NEXT:    sub t0, t1, t0
-; RV32ZBB-NEXT:    sub a6, a7, a6
-; RV32ZBB-NEXT:    sub a3, a5, a3
-; RV32ZBB-NEXT:    sub a1, a1, a4
-; RV32ZBB-NEXT:    sub a4, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t4
-; RV32ZBB-NEXT:    sub a2, a1, t2
-; RV32ZBB-NEXT:    sub a1, a4, a5
-; RV32ZBB-NEXT:    sub a4, a6, t4
+; RV32ZBB-NEXT:    sub a7, t1, a7
+; RV32ZBB-NEXT:    sub a5, a6, a5
+; RV32ZBB-NEXT:    sub a1, a1, a3
+; RV32ZBB-NEXT:    sub a2, a2, a4
+; RV32ZBB-NEXT:    sub a4, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t4
+; RV32ZBB-NEXT:    sub a3, a2, t2
+; RV32ZBB-NEXT:    sub a2, a4, a6
+; RV32ZBB-NEXT:    sub a4, a5, t4
 ; RV32ZBB-NEXT:    j .LBB11_13
 ; RV32ZBB-NEXT:  .LBB11_12:
-; RV32ZBB-NEXT:    sltu a2, a6, a7
-; RV32ZBB-NEXT:    sub t0, t0, t1
-; RV32ZBB-NEXT:    sub a6, a6, a7
-; RV32ZBB-NEXT:    sub a3, a3, a5
-; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a1, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t6
-; RV32ZBB-NEXT:    sub a2, a4, t5
-; RV32ZBB-NEXT:    sub a1, a1, a5
-; RV32ZBB-NEXT:    sub a4, a6, t6
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
+; RV32ZBB-NEXT:    sub a5, a5, a6
+; RV32ZBB-NEXT:    sub a1, a3, a1
+; RV32ZBB-NEXT:    sub a4, a4, a2
+; RV32ZBB-NEXT:    sub a2, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t6
+; RV32ZBB-NEXT:    sub a3, a4, t5
+; RV32ZBB-NEXT:    sub a2, a2, a6
+; RV32ZBB-NEXT:    sub a4, a5, t6
 ; RV32ZBB-NEXT:  .LBB11_13:
-; RV32ZBB-NEXT:    sw a3, 0(a0)
-; RV32ZBB-NEXT:    sw a2, 4(a0)
+; RV32ZBB-NEXT:    sw a1, 0(a0)
+; RV32ZBB-NEXT:    sw a3, 4(a0)
 ; RV32ZBB-NEXT:    sw a4, 8(a0)
-; RV32ZBB-NEXT:    sw a1, 12(a0)
+; RV32ZBB-NEXT:    sw a2, 12(a0)
 ; RV32ZBB-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32ZBB-NEXT:    addi sp, sp, 16
 ; RV32ZBB-NEXT:    ret
@@ -736,73 +736,73 @@ define i128 @abd_ext_i128_undef(i128 %a, i128 %b) nounwind {
 ; RV32I:       # %bb.0:
 ; RV32I-NEXT:    lw a3, 0(a1)
 ; RV32I-NEXT:    lw a4, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
-; RV32I-NEXT:    lw a7, 8(a2)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a7, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
+; RV32I-NEXT:    lw a6, 8(a2)
 ; RV32I-NEXT:    lw t1, 12(a2)
-; RV32I-NEXT:    lw a5, 0(a2)
-; RV32I-NEXT:    lw a1, 4(a2)
-; RV32I-NEXT:    sltu a2, a7, a6
-; RV32I-NEXT:    mv t4, a2
-; RV32I-NEXT:    beq t0, t1, .LBB12_2
+; RV32I-NEXT:    lw a2, 4(a2)
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq a7, t1, .LBB12_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    slt t4, t1, t0
+; RV32I-NEXT:    slt t4, t1, a7
 ; RV32I-NEXT:  .LBB12_2:
-; RV32I-NEXT:    sltu t2, a5, a3
-; RV32I-NEXT:    sltu t5, a1, a4
+; RV32I-NEXT:    sltu t2, a1, a3
+; RV32I-NEXT:    sltu t5, a2, a4
 ; RV32I-NEXT:    mv t3, t2
-; RV32I-NEXT:    beq a4, a1, .LBB12_4
+; RV32I-NEXT:    beq a4, a2, .LBB12_4
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    mv t3, t5
 ; RV32I-NEXT:  .LBB12_4:
 ; RV32I-NEXT:    addi sp, sp, -16
 ; RV32I-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    xor t6, t0, t1
-; RV32I-NEXT:    xor s0, a6, a7
+; RV32I-NEXT:    xor t6, a7, t1
+; RV32I-NEXT:    xor s0, a5, a6
 ; RV32I-NEXT:    or t6, s0, t6
 ; RV32I-NEXT:    beqz t6, .LBB12_6
 ; RV32I-NEXT:  # %bb.5:
 ; RV32I-NEXT:    mv t3, t4
 ; RV32I-NEXT:  .LBB12_6:
 ; RV32I-NEXT:    mv t4, t2
-; RV32I-NEXT:    beq a1, a4, .LBB12_8
+; RV32I-NEXT:    beq a2, a4, .LBB12_8
 ; RV32I-NEXT:  # %bb.7:
 ; RV32I-NEXT:    mv t4, t5
 ; RV32I-NEXT:  .LBB12_8:
-; RV32I-NEXT:    sltu t5, a3, a5
+; RV32I-NEXT:    sltu t5, a3, a1
 ; RV32I-NEXT:    mv t6, t5
-; RV32I-NEXT:    beq a4, a1, .LBB12_10
+; RV32I-NEXT:    beq a4, a2, .LBB12_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t6, a4, a1
+; RV32I-NEXT:    sltu t6, a4, a2
 ; RV32I-NEXT:  .LBB12_10:
 ; RV32I-NEXT:    bnez t3, .LBB12_12
 ; RV32I-NEXT:  # %bb.11:
-; RV32I-NEXT:    sub t0, t1, t0
-; RV32I-NEXT:    sub a6, a7, a6
-; RV32I-NEXT:    sub a3, a5, a3
-; RV32I-NEXT:    sub a1, a1, a4
-; RV32I-NEXT:    sub a4, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t4
-; RV32I-NEXT:    sub a2, a1, t2
-; RV32I-NEXT:    sub a1, a4, a5
-; RV32I-NEXT:    sub a4, a6, t4
+; RV32I-NEXT:    sub a7, t1, a7
+; RV32I-NEXT:    sub a5, a6, a5
+; RV32I-NEXT:    sub a1, a1, a3
+; RV32I-NEXT:    sub a2, a2, a4
+; RV32I-NEXT:    sub a4, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t4
+; RV32I-NEXT:    sub a3, a2, t2
+; RV32I-NEXT:    sub a2, a4, a6
+; RV32I-NEXT:    sub a4, a5, t4
 ; RV32I-NEXT:    j .LBB12_13
 ; RV32I-NEXT:  .LBB12_12:
-; RV32I-NEXT:    sltu a2, a6, a7
-; RV32I-NEXT:    sub t0, t0, t1
-; RV32I-NEXT:    sub a6, a6, a7
-; RV32I-NEXT:    sub a3, a3, a5
-; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a1, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t6
-; RV32I-NEXT:    sub a2, a4, t5
-; RV32I-NEXT:    sub a1, a1, a5
-; RV32I-NEXT:    sub a4, a6, t6
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
+; RV32I-NEXT:    sub a5, a5, a6
+; RV32I-NEXT:    sub a1, a3, a1
+; RV32I-NEXT:    sub a4, a4, a2
+; RV32I-NEXT:    sub a2, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t6
+; RV32I-NEXT:    sub a3, a4, t5
+; RV32I-NEXT:    sub a2, a2, a6
+; RV32I-NEXT:    sub a4, a5, t6
 ; RV32I-NEXT:  .LBB12_13:
-; RV32I-NEXT:    sw a3, 0(a0)
-; RV32I-NEXT:    sw a2, 4(a0)
+; RV32I-NEXT:    sw a1, 0(a0)
+; RV32I-NEXT:    sw a3, 4(a0)
 ; RV32I-NEXT:    sw a4, 8(a0)
-; RV32I-NEXT:    sw a1, 12(a0)
+; RV32I-NEXT:    sw a2, 12(a0)
 ; RV32I-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
@@ -832,73 +832,73 @@ define i128 @abd_ext_i128_undef(i128 %a, i128 %b) nounwind {
 ; RV32ZBB:       # %bb.0:
 ; RV32ZBB-NEXT:    lw a3, 0(a1)
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
-; RV32ZBB-NEXT:    lw a7, 8(a2)
+; RV32ZBB-NEXT:    lw a5, 8(a1)
+; RV32ZBB-NEXT:    lw a7, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
+; RV32ZBB-NEXT:    lw a6, 8(a2)
 ; RV32ZBB-NEXT:    lw t1, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 0(a2)
-; RV32ZBB-NEXT:    lw a1, 4(a2)
-; RV32ZBB-NEXT:    sltu a2, a7, a6
-; RV32ZBB-NEXT:    mv t4, a2
-; RV32ZBB-NEXT:    beq t0, t1, .LBB12_2
+; RV32ZBB-NEXT:    lw a2, 4(a2)
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq a7, t1, .LBB12_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    slt t4, t1, t0
+; RV32ZBB-NEXT:    slt t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB12_2:
-; RV32ZBB-NEXT:    sltu t2, a5, a3
-; RV32ZBB-NEXT:    sltu t5, a1, a4
+; RV32ZBB-NEXT:    sltu t2, a1, a3
+; RV32ZBB-NEXT:    sltu t5, a2, a4
 ; RV32ZBB-NEXT:    mv t3, t2
-; RV32ZBB-NEXT:    beq a4, a1, .LBB12_4
+; RV32ZBB-NEXT:    beq a4, a2, .LBB12_4
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    mv t3, t5
 ; RV32ZBB-NEXT:  .LBB12_4:
 ; RV32ZBB-NEXT:    addi sp, sp, -16
 ; RV32ZBB-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32ZBB-NEXT:    xor t6, t0, t1
-; RV32ZBB-NEXT:    xor s0, a6, a7
+; RV32ZBB-NEXT:    xor t6, a7, t1
+; RV32ZBB-NEXT:    xor s0, a5, a6
 ; RV32ZBB-NEXT:    or t6, s0, t6
 ; RV32ZBB-NEXT:    beqz t6, .LBB12_6
 ; RV32ZBB-NEXT:  # %bb.5:
 ; RV32ZBB-NEXT:    mv t3, t4
 ; RV32ZBB-NEXT:  .LBB12_6:
 ; RV32ZBB-NEXT:    mv t4, t2
-; RV32ZBB-NEXT:    beq a1, a4, .LBB12_8
+; RV32ZBB-NEXT:    beq a2, a4, .LBB12_8
 ; RV32ZBB-NEXT:  # %bb.7:
 ; RV32ZBB-NEXT:    mv t4, t5
 ; RV32ZBB-NEXT:  .LBB12_8:
-; RV32ZBB-NEXT:    sltu t5, a3, a5
+; RV32ZBB-NEXT:    sltu t5, a3, a1
 ; RV32ZBB-NEXT:    mv t6, t5
-; RV32ZBB-NEXT:    beq a4, a1, .LBB12_10
+; RV32ZBB-NEXT:    beq a4, a2, .LBB12_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t6, a4, a1
+; RV32ZBB-NEXT:    sltu t6, a4, a2
 ; RV32ZBB-NEXT:  .LBB12_10:
 ; RV32ZBB-NEXT:    bnez t3, .LBB12_12
 ; RV32ZBB-NEXT:  # %bb.11:
-; RV32ZBB-NEXT:    sub t0, t1, t0
-; RV32ZBB-NEXT:    sub a6, a7, a6
-; RV32ZBB-NEXT:    sub a3, a5, a3
-; RV32ZBB-NEXT:    sub a1, a1, a4
-; RV32ZBB-NEXT:    sub a4, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t4
-; RV32ZBB-NEXT:    sub a2, a1, t2
-; RV32ZBB-NEXT:    sub a1, a4, a5
-; RV32ZBB-NEXT:    sub a4, a6, t4
+; RV32ZBB-NEXT:    sub a7, t1, a7
+; RV32ZBB-NEXT:    sub a5, a6, a5
+; RV32ZBB-NEXT:    sub a1, a1, a3
+; RV32ZBB-NEXT:    sub a2, a2, a4
+; RV32ZBB-NEXT:    sub a4, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t4
+; RV32ZBB-NEXT:    sub a3, a2, t2
+; RV32ZBB-NEXT:    sub a2, a4, a6
+; RV32ZBB-NEXT:    sub a4, a5, t4
 ; RV32ZBB-NEXT:    j .LBB12_13
 ; RV32ZBB-NEXT:  .LBB12_12:
-; RV32ZBB-NEXT:    sltu a2, a6, a7
-; RV32ZBB-NEXT:    sub t0, t0, t1
-; RV32ZBB-NEXT:    sub a6, a6, a7
-; RV32ZBB-NEXT:    sub a3, a3, a5
-; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a1, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t6
-; RV32ZBB-NEXT:    sub a2, a4, t5
-; RV32ZBB-NEXT:    sub a1, a1, a5
-; RV32ZBB-NEXT:    sub a4, a6, t6
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
+; RV32ZBB-NEXT:    sub a5, a5, a6
+; RV32ZBB-NEXT:    sub a1, a3, a1
+; RV32ZBB-NEXT:    sub a4, a4, a2
+; RV32ZBB-NEXT:    sub a2, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t6
+; RV32ZBB-NEXT:    sub a3, a4, t5
+; RV32ZBB-NEXT:    sub a2, a2, a6
+; RV32ZBB-NEXT:    sub a4, a5, t6
 ; RV32ZBB-NEXT:  .LBB12_13:
-; RV32ZBB-NEXT:    sw a3, 0(a0)
-; RV32ZBB-NEXT:    sw a2, 4(a0)
+; RV32ZBB-NEXT:    sw a1, 0(a0)
+; RV32ZBB-NEXT:    sw a3, 4(a0)
 ; RV32ZBB-NEXT:    sw a4, 8(a0)
-; RV32ZBB-NEXT:    sw a1, 12(a0)
+; RV32ZBB-NEXT:    sw a2, 12(a0)
 ; RV32ZBB-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32ZBB-NEXT:    addi sp, sp, 16
 ; RV32ZBB-NEXT:    ret
@@ -1125,73 +1125,73 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 ; RV32I:       # %bb.0:
 ; RV32I-NEXT:    lw a3, 0(a1)
 ; RV32I-NEXT:    lw a4, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
-; RV32I-NEXT:    lw a7, 8(a2)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a7, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
+; RV32I-NEXT:    lw a6, 8(a2)
 ; RV32I-NEXT:    lw t1, 12(a2)
-; RV32I-NEXT:    lw a5, 0(a2)
-; RV32I-NEXT:    lw a1, 4(a2)
-; RV32I-NEXT:    sltu a2, a7, a6
-; RV32I-NEXT:    mv t4, a2
-; RV32I-NEXT:    beq t0, t1, .LBB17_2
+; RV32I-NEXT:    lw a2, 4(a2)
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq a7, t1, .LBB17_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    slt t4, t1, t0
+; RV32I-NEXT:    slt t4, t1, a7
 ; RV32I-NEXT:  .LBB17_2:
-; RV32I-NEXT:    sltu t2, a5, a3
-; RV32I-NEXT:    sltu t5, a1, a4
+; RV32I-NEXT:    sltu t2, a1, a3
+; RV32I-NEXT:    sltu t5, a2, a4
 ; RV32I-NEXT:    mv t3, t2
-; RV32I-NEXT:    beq a4, a1, .LBB17_4
+; RV32I-NEXT:    beq a4, a2, .LBB17_4
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    mv t3, t5
 ; RV32I-NEXT:  .LBB17_4:
 ; RV32I-NEXT:    addi sp, sp, -16
 ; RV32I-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    xor t6, t0, t1
-; RV32I-NEXT:    xor s0, a6, a7
+; RV32I-NEXT:    xor t6, a7, t1
+; RV32I-NEXT:    xor s0, a5, a6
 ; RV32I-NEXT:    or t6, s0, t6
 ; RV32I-NEXT:    beqz t6, .LBB17_6
 ; RV32I-NEXT:  # %bb.5:
 ; RV32I-NEXT:    mv t3, t4
 ; RV32I-NEXT:  .LBB17_6:
 ; RV32I-NEXT:    mv t4, t2
-; RV32I-NEXT:    beq a1, a4, .LBB17_8
+; RV32I-NEXT:    beq a2, a4, .LBB17_8
 ; RV32I-NEXT:  # %bb.7:
 ; RV32I-NEXT:    mv t4, t5
 ; RV32I-NEXT:  .LBB17_8:
-; RV32I-NEXT:    sltu t5, a3, a5
+; RV32I-NEXT:    sltu t5, a3, a1
 ; RV32I-NEXT:    mv t6, t5
-; RV32I-NEXT:    beq a4, a1, .LBB17_10
+; RV32I-NEXT:    beq a4, a2, .LBB17_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t6, a4, a1
+; RV32I-NEXT:    sltu t6, a4, a2
 ; RV32I-NEXT:  .LBB17_10:
 ; RV32I-NEXT:    bnez t3, .LBB17_12
 ; RV32I-NEXT:  # %bb.11:
-; RV32I-NEXT:    sub t0, t1, t0
-; RV32I-NEXT:    sub a6, a7, a6
-; RV32I-NEXT:    sub a3, a5, a3
-; RV32I-NEXT:    sub a1, a1, a4
-; RV32I-NEXT:    sub a4, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t4
-; RV32I-NEXT:    sub a2, a1, t2
-; RV32I-NEXT:    sub a1, a4, a5
-; RV32I-NEXT:    sub a4, a6, t4
+; RV32I-NEXT:    sub a7, t1, a7
+; RV32I-NEXT:    sub a5, a6, a5
+; RV32I-NEXT:    sub a1, a1, a3
+; RV32I-NEXT:    sub a2, a2, a4
+; RV32I-NEXT:    sub a4, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t4
+; RV32I-NEXT:    sub a3, a2, t2
+; RV32I-NEXT:    sub a2, a4, a6
+; RV32I-NEXT:    sub a4, a5, t4
 ; RV32I-NEXT:    j .LBB17_13
 ; RV32I-NEXT:  .LBB17_12:
-; RV32I-NEXT:    sltu a2, a6, a7
-; RV32I-NEXT:    sub t0, t0, t1
-; RV32I-NEXT:    sub a6, a6, a7
-; RV32I-NEXT:    sub a3, a3, a5
-; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a1, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t6
-; RV32I-NEXT:    sub a2, a4, t5
-; RV32I-NEXT:    sub a1, a1, a5
-; RV32I-NEXT:    sub a4, a6, t6
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
+; RV32I-NEXT:    sub a5, a5, a6
+; RV32I-NEXT:    sub a1, a3, a1
+; RV32I-NEXT:    sub a4, a4, a2
+; RV32I-NEXT:    sub a2, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t6
+; RV32I-NEXT:    sub a3, a4, t5
+; RV32I-NEXT:    sub a2, a2, a6
+; RV32I-NEXT:    sub a4, a5, t6
 ; RV32I-NEXT:  .LBB17_13:
-; RV32I-NEXT:    sw a3, 0(a0)
-; RV32I-NEXT:    sw a2, 4(a0)
+; RV32I-NEXT:    sw a1, 0(a0)
+; RV32I-NEXT:    sw a3, 4(a0)
 ; RV32I-NEXT:    sw a4, 8(a0)
-; RV32I-NEXT:    sw a1, 12(a0)
+; RV32I-NEXT:    sw a2, 12(a0)
 ; RV32I-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
@@ -1221,73 +1221,73 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB:       # %bb.0:
 ; RV32ZBB-NEXT:    lw a3, 0(a1)
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
-; RV32ZBB-NEXT:    lw a7, 8(a2)
+; RV32ZBB-NEXT:    lw a5, 8(a1)
+; RV32ZBB-NEXT:    lw a7, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
+; RV32ZBB-NEXT:    lw a6, 8(a2)
 ; RV32ZBB-NEXT:    lw t1, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 0(a2)
-; RV32ZBB-NEXT:    lw a1, 4(a2)
-; RV32ZBB-NEXT:    sltu a2, a7, a6
-; RV32ZBB-NEXT:    mv t4, a2
-; RV32ZBB-NEXT:    beq t0, t1, .LBB17_2
+; RV32ZBB-NEXT:    lw a2, 4(a2)
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq a7, t1, .LBB17_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    slt t4, t1, t0
+; RV32ZBB-NEXT:    slt t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB17_2:
-; RV32ZBB-NEXT:    sltu t2, a5, a3
-; RV32ZBB-NEXT:    sltu t5, a1, a4
+; RV32ZBB-NEXT:    sltu t2, a1, a3
+; RV32ZBB-NEXT:    sltu t5, a2, a4
 ; RV32ZBB-NEXT:    mv t3, t2
-; RV32ZBB-NEXT:    beq a4, a1, .LBB17_4
+; RV32ZBB-NEXT:    beq a4, a2, .LBB17_4
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    mv t3, t5
 ; RV32ZBB-NEXT:  .LBB17_4:
 ; RV32ZBB-NEXT:    addi sp, sp, -16
 ; RV32ZBB-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32ZBB-NEXT:    xor t6, t0, t1
-; RV32ZBB-NEXT:    xor s0, a6, a7
+; RV32ZBB-NEXT:    xor t6, a7, t1
+; RV32ZBB-NEXT:    xor s0, a5, a6
 ; RV32ZBB-NEXT:    or t6, s0, t6
 ; RV32ZBB-NEXT:    beqz t6, .LBB17_6
 ; RV32ZBB-NEXT:  # %bb.5:
 ; RV32ZBB-NEXT:    mv t3, t4
 ; RV32ZBB-NEXT:  .LBB17_6:
 ; RV32ZBB-NEXT:    mv t4, t2
-; RV32ZBB-NEXT:    beq a1, a4, .LBB17_8
+; RV32ZBB-NEXT:    beq a2, a4, .LBB17_8
 ; RV32ZBB-NEXT:  # %bb.7:
 ; RV32ZBB-NEXT:    mv t4, t5
 ; RV32ZBB-NEXT:  .LBB17_8:
-; RV32ZBB-NEXT:    sltu t5, a3, a5
+; RV32ZBB-NEXT:    sltu t5, a3, a1
 ; RV32ZBB-NEXT:    mv t6, t5
-; RV32ZBB-NEXT:    beq a4, a1, .LBB17_10
+; RV32ZBB-NEXT:    beq a4, a2, .LBB17_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t6, a4, a1
+; RV32ZBB-NEXT:    sltu t6, a4, a2
 ; RV32ZBB-NEXT:  .LBB17_10:
 ; RV32ZBB-NEXT:    bnez t3, .LBB17_12
 ; RV32ZBB-NEXT:  # %bb.11:
-; RV32ZBB-NEXT:    sub t0, t1, t0
-; RV32ZBB-NEXT:    sub a6, a7, a6
-; RV32ZBB-NEXT:    sub a3, a5, a3
-; RV32ZBB-NEXT:    sub a1, a1, a4
-; RV32ZBB-NEXT:    sub a4, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t4
-; RV32ZBB-NEXT:    sub a2, a1, t2
-; RV32ZBB-NEXT:    sub a1, a4, a5
-; RV32ZBB-NEXT:    sub a4, a6, t4
+; RV32ZBB-NEXT:    sub a7, t1, a7
+; RV32ZBB-NEXT:    sub a5, a6, a5
+; RV32ZBB-NEXT:    sub a1, a1, a3
+; RV32ZBB-NEXT:    sub a2, a2, a4
+; RV32ZBB-NEXT:    sub a4, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t4
+; RV32ZBB-NEXT:    sub a3, a2, t2
+; RV32ZBB-NEXT:    sub a2, a4, a6
+; RV32ZBB-NEXT:    sub a4, a5, t4
 ; RV32ZBB-NEXT:    j .LBB17_13
 ; RV32ZBB-NEXT:  .LBB17_12:
-; RV32ZBB-NEXT:    sltu a2, a6, a7
-; RV32ZBB-NEXT:    sub t0, t0, t1
-; RV32ZBB-NEXT:    sub a6, a6, a7
-; RV32ZBB-NEXT:    sub a3, a3, a5
-; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a1, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t6
-; RV32ZBB-NEXT:    sub a2, a4, t5
-; RV32ZBB-NEXT:    sub a1, a1, a5
-; RV32ZBB-NEXT:    sub a4, a6, t6
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
+; RV32ZBB-NEXT:    sub a5, a5, a6
+; RV32ZBB-NEXT:    sub a1, a3, a1
+; RV32ZBB-NEXT:    sub a4, a4, a2
+; RV32ZBB-NEXT:    sub a2, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t6
+; RV32ZBB-NEXT:    sub a3, a4, t5
+; RV32ZBB-NEXT:    sub a2, a2, a6
+; RV32ZBB-NEXT:    sub a4, a5, t6
 ; RV32ZBB-NEXT:  .LBB17_13:
-; RV32ZBB-NEXT:    sw a3, 0(a0)
-; RV32ZBB-NEXT:    sw a2, 4(a0)
+; RV32ZBB-NEXT:    sw a1, 0(a0)
+; RV32ZBB-NEXT:    sw a3, 4(a0)
 ; RV32ZBB-NEXT:    sw a4, 8(a0)
-; RV32ZBB-NEXT:    sw a1, 12(a0)
+; RV32ZBB-NEXT:    sw a2, 12(a0)
 ; RV32ZBB-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32ZBB-NEXT:    addi sp, sp, 16
 ; RV32ZBB-NEXT:    ret
@@ -1516,73 +1516,73 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I:       # %bb.0:
 ; RV32I-NEXT:    lw a3, 0(a1)
 ; RV32I-NEXT:    lw a4, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
-; RV32I-NEXT:    lw a7, 8(a2)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a7, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
+; RV32I-NEXT:    lw a6, 8(a2)
 ; RV32I-NEXT:    lw t1, 12(a2)
-; RV32I-NEXT:    lw a5, 0(a2)
-; RV32I-NEXT:    lw a1, 4(a2)
-; RV32I-NEXT:    sltu a2, a7, a6
-; RV32I-NEXT:    mv t4, a2
-; RV32I-NEXT:    beq t0, t1, .LBB22_2
+; RV32I-NEXT:    lw a2, 4(a2)
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq a7, t1, .LBB22_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    slt t4, t1, t0
+; RV32I-NEXT:    slt t4, t1, a7
 ; RV32I-NEXT:  .LBB22_2:
-; RV32I-NEXT:    sltu t2, a5, a3
-; RV32I-NEXT:    sltu t5, a1, a4
+; RV32I-NEXT:    sltu t2, a1, a3
+; RV32I-NEXT:    sltu t5, a2, a4
 ; RV32I-NEXT:    mv t3, t2
-; RV32I-NEXT:    beq a4, a1, .LBB22_4
+; RV32I-NEXT:    beq a4, a2, .LBB22_4
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    mv t3, t5
 ; RV32I-NEXT:  .LBB22_4:
 ; RV32I-NEXT:    addi sp, sp, -16
 ; RV32I-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    xor t6, t0, t1
-; RV32I-NEXT:    xor s0, a6, a7
+; RV32I-NEXT:    xor t6, a7, t1
+; RV32I-NEXT:    xor s0, a5, a6
 ; RV32I-NEXT:    or t6, s0, t6
 ; RV32I-NEXT:    beqz t6, .LBB22_6
 ; RV32I-NEXT:  # %bb.5:
 ; RV32I-NEXT:    mv t3, t4
 ; RV32I-NEXT:  .LBB22_6:
 ; RV32I-NEXT:    mv t4, t2
-; RV32I-NEXT:    beq a1, a4, .LBB22_8
+; RV32I-NEXT:    beq a2, a4, .LBB22_8
 ; RV32I-NEXT:  # %bb.7:
 ; RV32I-NEXT:    mv t4, t5
 ; RV32I-NEXT:  .LBB22_8:
-; RV32I-NEXT:    sltu t5, a3, a5
+; RV32I-NEXT:    sltu t5, a3, a1
 ; RV32I-NEXT:    mv t6, t5
-; RV32I-NEXT:    beq a4, a1, .LBB22_10
+; RV32I-NEXT:    beq a4, a2, .LBB22_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t6, a4, a1
+; RV32I-NEXT:    sltu t6, a4, a2
 ; RV32I-NEXT:  .LBB22_10:
 ; RV32I-NEXT:    bnez t3, .LBB22_12
 ; RV32I-NEXT:  # %bb.11:
-; RV32I-NEXT:    sub t0, t1, t0
-; RV32I-NEXT:    sub a6, a7, a6
-; RV32I-NEXT:    sub a3, a5, a3
-; RV32I-NEXT:    sub a1, a1, a4
-; RV32I-NEXT:    sub a4, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t4
-; RV32I-NEXT:    sub a2, a1, t2
-; RV32I-NEXT:    sub a1, a4, a5
-; RV32I-NEXT:    sub a4, a6, t4
+; RV32I-NEXT:    sub a7, t1, a7
+; RV32I-NEXT:    sub a5, a6, a5
+; RV32I-NEXT:    sub a1, a1, a3
+; RV32I-NEXT:    sub a2, a2, a4
+; RV32I-NEXT:    sub a4, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t4
+; RV32I-NEXT:    sub a3, a2, t2
+; RV32I-NEXT:    sub a2, a4, a6
+; RV32I-NEXT:    sub a4, a5, t4
 ; RV32I-NEXT:    j .LBB22_13
 ; RV32I-NEXT:  .LBB22_12:
-; RV32I-NEXT:    sltu a2, a6, a7
-; RV32I-NEXT:    sub t0, t0, t1
-; RV32I-NEXT:    sub a6, a6, a7
-; RV32I-NEXT:    sub a3, a3, a5
-; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a1, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t6
-; RV32I-NEXT:    sub a2, a4, t5
-; RV32I-NEXT:    sub a1, a1, a5
-; RV32I-NEXT:    sub a4, a6, t6
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
+; RV32I-NEXT:    sub a5, a5, a6
+; RV32I-NEXT:    sub a1, a3, a1
+; RV32I-NEXT:    sub a4, a4, a2
+; RV32I-NEXT:    sub a2, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t6
+; RV32I-NEXT:    sub a3, a4, t5
+; RV32I-NEXT:    sub a2, a2, a6
+; RV32I-NEXT:    sub a4, a5, t6
 ; RV32I-NEXT:  .LBB22_13:
-; RV32I-NEXT:    sw a3, 0(a0)
-; RV32I-NEXT:    sw a2, 4(a0)
+; RV32I-NEXT:    sw a1, 0(a0)
+; RV32I-NEXT:    sw a3, 4(a0)
 ; RV32I-NEXT:    sw a4, 8(a0)
-; RV32I-NEXT:    sw a1, 12(a0)
+; RV32I-NEXT:    sw a2, 12(a0)
 ; RV32I-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
@@ -1612,73 +1612,73 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB:       # %bb.0:
 ; RV32ZBB-NEXT:    lw a3, 0(a1)
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
-; RV32ZBB-NEXT:    lw a7, 8(a2)
+; RV32ZBB-NEXT:    lw a5, 8(a1)
+; RV32ZBB-NEXT:    lw a7, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
+; RV32ZBB-NEXT:    lw a6, 8(a2)
 ; RV32ZBB-NEXT:    lw t1, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 0(a2)
-; RV32ZBB-NEXT:    lw a1, 4(a2)
-; RV32ZBB-NEXT:    sltu a2, a7, a6
-; RV32ZBB-NEXT:    mv t4, a2
-; RV32ZBB-NEXT:    beq t0, t1, .LBB22_2
+; RV32ZBB-NEXT:    lw a2, 4(a2)
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq a7, t1, .LBB22_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    slt t4, t1, t0
+; RV32ZBB-NEXT:    slt t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB22_2:
-; RV32ZBB-NEXT:    sltu t2, a5, a3
-; RV32ZBB-NEXT:    sltu t5, a1, a4
+; RV32ZBB-NEXT:    sltu t2, a1, a3
+; RV32ZBB-NEXT:    sltu t5, a2, a4
 ; RV32ZBB-NEXT:    mv t3, t2
-; RV32ZBB-NEXT:    beq a4, a1, .LBB22_4
+; RV32ZBB-NEXT:    beq a4, a2, .LBB22_4
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    mv t3, t5
 ; RV32ZBB-NEXT:  .LBB22_4:
 ; RV32ZBB-NEXT:    addi sp, sp, -16
 ; RV32ZBB-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32ZBB-NEXT:    xor t6, t0, t1
-; RV32ZBB-NEXT:    xor s0, a6, a7
+; RV32ZBB-NEXT:    xor t6, a7, t1
+; RV32ZBB-NEXT:    xor s0, a5, a6
 ; RV32ZBB-NEXT:    or t6, s0, t6
 ; RV32ZBB-NEXT:    beqz t6, .LBB22_6
 ; RV32ZBB-NEXT:  # %bb.5:
 ; RV32ZBB-NEXT:    mv t3, t4
 ; RV32ZBB-NEXT:  .LBB22_6:
 ; RV32ZBB-NEXT:    mv t4, t2
-; RV32ZBB-NEXT:    beq a1, a4, .LBB22_8
+; RV32ZBB-NEXT:    beq a2, a4, .LBB22_8
 ; RV32ZBB-NEXT:  # %bb.7:
 ; RV32ZBB-NEXT:    mv t4, t5
 ; RV32ZBB-NEXT:  .LBB22_8:
-; RV32ZBB-NEXT:    sltu t5, a3, a5
+; RV32ZBB-NEXT:    sltu t5, a3, a1
 ; RV32ZBB-NEXT:    mv t6, t5
-; RV32ZBB-NEXT:    beq a4, a1, .LBB22_10
+; RV32ZBB-NEXT:    beq a4, a2, .LBB22_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t6, a4, a1
+; RV32ZBB-NEXT:    sltu t6, a4, a2
 ; RV32ZBB-NEXT:  .LBB22_10:
 ; RV32ZBB-NEXT:    bnez t3, .LBB22_12
 ; RV32ZBB-NEXT:  # %bb.11:
-; RV32ZBB-NEXT:    sub t0, t1, t0
-; RV32ZBB-NEXT:    sub a6, a7, a6
-; RV32ZBB-NEXT:    sub a3, a5, a3
-; RV32ZBB-NEXT:    sub a1, a1, a4
-; RV32ZBB-NEXT:    sub a4, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t4
-; RV32ZBB-NEXT:    sub a2, a1, t2
-; RV32ZBB-NEXT:    sub a1, a4, a5
-; RV32ZBB-NEXT:    sub a4, a6, t4
+; RV32ZBB-NEXT:    sub a7, t1, a7
+; RV32ZBB-NEXT:    sub a5, a6, a5
+; RV32ZBB-NEXT:    sub a1, a1, a3
+; RV32ZBB-NEXT:    sub a2, a2, a4
+; RV32ZBB-NEXT:    sub a4, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t4
+; RV32ZBB-NEXT:    sub a3, a2, t2
+; RV32ZBB-NEXT:    sub a2, a4, a6
+; RV32ZBB-NEXT:    sub a4, a5, t4
 ; RV32ZBB-NEXT:    j .LBB22_13
 ; RV32ZBB-NEXT:  .LBB22_12:
-; RV32ZBB-NEXT:    sltu a2, a6, a7
-; RV32ZBB-NEXT:    sub t0, t0, t1
-; RV32ZBB-NEXT:    sub a6, a6, a7
-; RV32ZBB-NEXT:    sub a3, a3, a5
-; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a1, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t6
-; RV32ZBB-NEXT:    sub a2, a4, t5
-; RV32ZBB-NEXT:    sub a1, a1, a5
-; RV32ZBB-NEXT:    sub a4, a6, t6
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
+; RV32ZBB-NEXT:    sub a5, a5, a6
+; RV32ZBB-NEXT:    sub a1, a3, a1
+; RV32ZBB-NEXT:    sub a4, a4, a2
+; RV32ZBB-NEXT:    sub a2, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t6
+; RV32ZBB-NEXT:    sub a3, a4, t5
+; RV32ZBB-NEXT:    sub a2, a2, a6
+; RV32ZBB-NEXT:    sub a4, a5, t6
 ; RV32ZBB-NEXT:  .LBB22_13:
-; RV32ZBB-NEXT:    sw a3, 0(a0)
-; RV32ZBB-NEXT:    sw a2, 4(a0)
+; RV32ZBB-NEXT:    sw a1, 0(a0)
+; RV32ZBB-NEXT:    sw a3, 4(a0)
 ; RV32ZBB-NEXT:    sw a4, 8(a0)
-; RV32ZBB-NEXT:    sw a1, 12(a0)
+; RV32ZBB-NEXT:    sw a2, 12(a0)
 ; RV32ZBB-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32ZBB-NEXT:    addi sp, sp, 16
 ; RV32ZBB-NEXT:    ret
@@ -2539,73 +2539,73 @@ define i128 @abd_select_i128(i128 %a, i128 %b) nounwind {
 ; RV32I:       # %bb.0:
 ; RV32I-NEXT:    lw a3, 0(a1)
 ; RV32I-NEXT:    lw a4, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
-; RV32I-NEXT:    lw a7, 8(a2)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a7, 12(a1)
+; RV32I-NEXT:    lw a1, 0(a2)
+; RV32I-NEXT:    lw a6, 8(a2)
 ; RV32I-NEXT:    lw t1, 12(a2)
-; RV32I-NEXT:    lw a5, 0(a2)
-; RV32I-NEXT:    lw a1, 4(a2)
-; RV32I-NEXT:    sltu a2, a7, a6
-; RV32I-NEXT:    mv t4, a2
-; RV32I-NEXT:    beq t0, t1, .LBB38_2
+; RV32I-NEXT:    lw a2, 4(a2)
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq a7, t1, .LBB38_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    slt t4, t1, t0
+; RV32I-NEXT:    slt t4, t1, a7
 ; RV32I-NEXT:  .LBB38_2:
-; RV32I-NEXT:    sltu t2, a5, a3
-; RV32I-NEXT:    sltu t5, a1, a4
+; RV32I-NEXT:    sltu t2, a1, a3
+; RV32I-NEXT:    sltu t5, a2, a4
 ; RV32I-NEXT:    mv t3, t2
-; RV32I-NEXT:    beq a4, a1, .LBB38_4
+; RV32I-NEXT:    beq a4, a2, .LBB38_4
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    mv t3, t5
 ; RV32I-NEXT:  .LBB38_4:
 ; RV32I-NEXT:    addi sp, sp, -16
 ; RV32I-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    xor t6, t0, t1
-; RV32I-NEXT:    xor s0, a6, a7
+; RV32I-NEXT:    xor t6, a7, t1
+; RV32I-NEXT:    xor s0, a5, a6
 ; RV32I-NEXT:    or t6, s0, t6
 ; RV32I-NEXT:    beqz t6, .LBB38_6
 ; RV32I-NEXT:  # %bb.5:
 ; RV32I-NEXT:    mv t3, t4
 ; RV32I-NEXT:  .LBB38_6:
 ; RV32I-NEXT:    mv t4, t2
-; RV32I-NEXT:    beq a1, a4, .LBB38_8
+; RV32I-NEXT:    beq a2, a4, .LBB38_8
 ; RV32I-NEXT:  # %bb.7:
 ; RV32I-NEXT:    mv t4, t5
 ; RV32I-NEXT:  .LBB38_8:
-; RV32I-NEXT:    sltu t5, a3, a5
+; RV32I-NEXT:    sltu t5, a3, a1
 ; RV32I-NEXT:    mv t6, t5
-; RV32I-NEXT:    beq a4, a1, .LBB38_10
+; RV32I-NEXT:    beq a4, a2, .LBB38_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t6, a4, a1
+; RV32I-NEXT:    sltu t6, a4, a2
 ; RV32I-NEXT:  .LBB38_10:
 ; RV32I-NEXT:    bnez t3, .LBB38_12
 ; RV32I-NEXT:  # %bb.11:
-; RV32I-NEXT:    sub t0, t1, t0
-; RV32I-NEXT:    sub a6, a7, a6
-; RV32I-NEXT:    sub a3, a5, a3
-; RV32I-NEXT:    sub a1, a1, a4
-; RV32I-NEXT:    sub a4, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t4
-; RV32I-NEXT:    sub a2, a1, t2
-; RV32I-NEXT:    sub a1, a4, a5
-; RV32I-NEXT:    sub a4, a6, t4
+; RV32I-NEXT:    sub a7, t1, a7
+; RV32I-NEXT:    sub a5, a6, a5
+; RV32I-NEXT:    sub a1, a1, a3
+; RV32I-NEXT:    sub a2, a2, a4
+; RV32I-NEXT:    sub a4, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t4
+; RV32I-NEXT:    sub a3, a2, t2
+; RV32I-NEXT:    sub a2, a4, a6
+; RV32I-NEXT:    sub a4, a5, t4
 ; RV32I-NEXT:    j .LBB38_13
 ; RV32I-NEXT:  .LBB38_12:
-; RV32I-NEXT:    sltu a2, a6, a7
-; RV32I-NEXT:    sub t0, t0, t1
-; RV32I-NEXT:    sub a6, a6, a7
-; RV32I-NEXT:    sub a3, a3, a5
-; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a1, t0, a2
-; RV32I-NEXT:    sltu a5, a6, t6
-; RV32I-NEXT:    sub a2, a4, t5
-; RV32I-NEXT:    sub a1, a1, a5
-; RV32I-NEXT:    sub a4, a6, t6
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
+; RV32I-NEXT:    sub a5, a5, a6
+; RV32I-NEXT:    sub a1, a3, a1
+; RV32I-NEXT:    sub a4, a4, a2
+; RV32I-NEXT:    sub a2, a7, t0
+; RV32I-NEXT:    sltu a6, a5, t6
+; RV32I-NEXT:    sub a3, a4, t5
+; RV32I-NEXT:    sub a2, a2, a6
+; RV32I-NEXT:    sub a4, a5, t6
 ; RV32I-NEXT:  .LBB38_13:
-; RV32I-NEXT:    sw a3, 0(a0)
-; RV32I-NEXT:    sw a2, 4(a0)
+; RV32I-NEXT:    sw a1, 0(a0)
+; RV32I-NEXT:    sw a3, 4(a0)
 ; RV32I-NEXT:    sw a4, 8(a0)
-; RV32I-NEXT:    sw a1, 12(a0)
+; RV32I-NEXT:    sw a2, 12(a0)
 ; RV32I-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
@@ -2635,73 +2635,73 @@ define i128 @abd_select_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB:       # %bb.0:
 ; RV32ZBB-NEXT:    lw a3, 0(a1)
 ; RV32ZBB-NEXT:    lw a4, 4(a1)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
-; RV32ZBB-NEXT:    lw a7, 8(a2)
+; RV32ZBB-NEXT:    lw a5, 8(a1)
+; RV32ZBB-NEXT:    lw a7, 12(a1)
+; RV32ZBB-NEXT:    lw a1, 0(a2)
+; RV32ZBB-NEXT:    lw a6, 8(a2)
 ; RV32ZBB-NEXT:    lw t1, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 0(a2)
-; RV32ZBB-NEXT:    lw a1, 4(a2)
-; RV32ZBB-NEXT:    sltu a2, a7, a6
-; RV32ZBB-NEXT:    mv t4, a2
-; RV32ZBB-NEXT:    beq t0, t1, .LBB38_2
+; RV32ZBB-NEXT:    lw a2, 4(a2)
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq a7, t1, .LBB38_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    slt t4, t1, t0
+; RV32ZBB-NEXT:    slt t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB38_2:
-; RV32ZBB-NEXT:    sltu t2, a5, a3
-; RV32ZBB-NEXT:    sltu t5, a1, a4
+; RV32ZBB-NEXT:    sltu t2, a1, a3
+; RV32ZBB-NEXT:    sltu t5, a2, a4
 ; RV32ZBB-NEXT:    mv t3, t2
-; RV32ZBB-NEXT:    beq a4, a1, .LBB38_4
+; RV32ZBB-NEXT:    beq a4, a2, .LBB38_4
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    mv t3, t5
 ; RV32ZBB-NEXT:  .LBB38_4:
 ; RV32ZBB-NEXT:    addi sp, sp, -16
 ; RV32ZBB-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
-; RV32ZBB-NEXT:    xor t6, t0, t1
-; RV32ZBB-NEXT:    xor s0, a6, a7
+; RV32ZBB-NEXT:    xor t6, a7, t1
+; RV32ZBB-NEXT:    xor s0, a5, a6
 ; RV32ZBB-NEXT:    or t6, s0, t6
 ; RV32ZBB-NEXT:    beqz t6, .LBB38_6
 ; RV32ZBB-NEXT:  # %bb.5:
 ; RV32ZBB-NEXT:    mv t3, t4
 ; RV32ZBB-NEXT:  .LBB38_6:
 ; RV32ZBB-NEXT:    mv t4, t2
-; RV32ZBB-NEXT:    beq a1, a4, .LBB38_8
+; RV32ZBB-NEXT:    beq a2, a4, .LBB38_8
 ; RV32ZBB-NEXT:  # %bb.7:
 ; RV32ZBB-NEXT:    mv t4, t5
 ; RV32ZBB-NEXT:  .LBB38_8:
-; RV32ZBB-NEXT:    sltu t5, a3, a5
+; RV32ZBB-NEXT:    sltu t5, a3, a1
 ; RV32ZBB-NEXT:    mv t6, t5
-; RV32ZBB-NEXT:    beq a4, a1, .LBB38_10
+; RV32ZBB-NEXT:    beq a4, a2, .LBB38_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t6, a4, a1
+; RV32ZBB-NEXT:    sltu t6, a4, a2
 ; RV32ZBB-NEXT:  .LBB38_10:
 ; RV32ZBB-NEXT:    bnez t3, .LBB38_12
 ; RV32ZBB-NEXT:  # %bb.11:
-; RV32ZBB-NEXT:    sub t0, t1, t0
-; RV32ZBB-NEXT:    sub a6, a7, a6
-; RV32ZBB-NEXT:    sub a3, a5, a3
-; RV32ZBB-NEXT:    sub a1, a1, a4
-; RV32ZBB-NEXT:    sub a4, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t4
-; RV32ZBB-NEXT:    sub a2, a1, t2
-; RV32ZBB-NEXT:    sub a1, a4, a5
-; RV32ZBB-NEXT:    sub a4, a6, t4
+; RV32ZBB-NEXT:    sub a7, t1, a7
+; RV32ZBB-NEXT:    sub a5, a6, a5
+; RV32ZBB-NEXT:    sub a1, a1, a3
+; RV32ZBB-NEXT:    sub a2, a2, a4
+; RV32ZBB-NEXT:    sub a4, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t4
+; RV32ZBB-NEXT:    sub a3, a2, t2
+; RV32ZBB-NEXT:    sub a2, a4, a6
+; RV32ZBB-NEXT:    sub a4, a5, t4
 ; RV32ZBB-NEXT:    j .LBB38_13
 ; RV32ZBB-NEXT:  .LBB38_12:
-; RV32ZBB-NEXT:    sltu a2, a6, a7
-; RV32ZBB-NEXT:    sub t0, t0, t1
-; RV32ZBB-NEXT:    sub a6, a6, a7
-; RV32ZBB-NEXT:    sub a3, a3, a5
-; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a1, t0, a2
-; RV32ZBB-NEXT:    sltu a5, a6, t6
-; RV32ZBB-NEXT:    sub a2, a4, t5
-; RV32ZBB-NEXT:    sub a1, a1, a5
-; RV32ZBB-NEXT:    sub a4, a6, t6
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
+; RV32ZBB-NEXT:    sub a5, a5, a6
+; RV32ZBB-NEXT:    sub a1, a3, a1
+; RV32ZBB-NEXT:    sub a4, a4, a2
+; RV32ZBB-NEXT:    sub a2, a7, t0
+; RV32ZBB-NEXT:    sltu a6, a5, t6
+; RV32ZBB-NEXT:    sub a3, a4, t5
+; RV32ZBB-NEXT:    sub a2, a2, a6
+; RV32ZBB-NEXT:    sub a4, a5, t6
 ; RV32ZBB-NEXT:  .LBB38_13:
-; RV32ZBB-NEXT:    sw a3, 0(a0)
-; RV32ZBB-NEXT:    sw a2, 4(a0)
+; RV32ZBB-NEXT:    sw a1, 0(a0)
+; RV32ZBB-NEXT:    sw a3, 4(a0)
 ; RV32ZBB-NEXT:    sw a4, 8(a0)
-; RV32ZBB-NEXT:    sw a1, 12(a0)
+; RV32ZBB-NEXT:    sw a2, 12(a0)
 ; RV32ZBB-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
 ; RV32ZBB-NEXT:    addi sp, sp, 16
 ; RV32ZBB-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/abdu-neg.ll b/llvm/test/CodeGen/RISCV/abdu-neg.ll
index 9fa142ee2aa1e..08a5b95ad69ed 100644
--- a/llvm/test/CodeGen/RISCV/abdu-neg.ll
+++ b/llvm/test/CodeGen/RISCV/abdu-neg.ll
@@ -1338,8 +1338,8 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    lw a6, 4(a2)
 ; RV32I-NEXT:    lw a7, 8(a2)
 ; RV32I-NEXT:    lw t0, 12(a2)
-; RV32I-NEXT:    lw a5, 12(a1)
 ; RV32I-NEXT:    lw a3, 4(a1)
+; RV32I-NEXT:    lw a5, 12(a1)
 ; RV32I-NEXT:    lw a4, 8(a1)
 ; RV32I-NEXT:    beq a5, t0, .LBB17_2
 ; RV32I-NEXT:  # %bb.1:
@@ -1465,8 +1465,8 @@ define i128 @abd_minmax_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    lw a6, 4(a2)
 ; RV32ZBB-NEXT:    lw a7, 8(a2)
 ; RV32ZBB-NEXT:    lw t0, 12(a2)
-; RV32ZBB-NEXT:    lw a5, 12(a1)
 ; RV32ZBB-NEXT:    lw a3, 4(a1)
+; RV32ZBB-NEXT:    lw a5, 12(a1)
 ; RV32ZBB-NEXT:    lw a4, 8(a1)
 ; RV32ZBB-NEXT:    beq a5, t0, .LBB17_2
 ; RV32ZBB-NEXT:  # %bb.1:
@@ -1801,15 +1801,15 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    lw a4, 4(a2)
 ; RV32I-NEXT:    lw a5, 8(a2)
 ; RV32I-NEXT:    lw a7, 12(a2)
-; RV32I-NEXT:    lw a6, 8(a1)
-; RV32I-NEXT:    lw t0, 12(a1)
 ; RV32I-NEXT:    lw a2, 0(a1)
+; RV32I-NEXT:    lw a6, 8(a1)
+; RV32I-NEXT:    lw t1, 12(a1)
 ; RV32I-NEXT:    lw a1, 4(a1)
-; RV32I-NEXT:    sltu t1, a6, a5
-; RV32I-NEXT:    mv t4, t1
-; RV32I-NEXT:    beq t0, a7, .LBB22_2
+; RV32I-NEXT:    sltu t0, a6, a5
+; RV32I-NEXT:    mv t4, t0
+; RV32I-NEXT:    beq t1, a7, .LBB22_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    sltu t4, t0, a7
+; RV32I-NEXT:    sltu t4, t1, a7
 ; RV32I-NEXT:  .LBB22_2:
 ; RV32I-NEXT:    sltu t2, a2, a3
 ; RV32I-NEXT:    mv t3, t2
@@ -1817,7 +1817,7 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:  # %bb.3:
 ; RV32I-NEXT:    sltu t3, a1, a4
 ; RV32I-NEXT:  .LBB22_4:
-; RV32I-NEXT:    xor t5, t0, a7
+; RV32I-NEXT:    xor t5, t1, a7
 ; RV32I-NEXT:    xor t6, a6, a5
 ; RV32I-NEXT:    or t5, t6, t5
 ; RV32I-NEXT:    mv t6, t3
@@ -1833,11 +1833,11 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:  .LBB22_8:
 ; RV32I-NEXT:    bnez t6, .LBB22_10
 ; RV32I-NEXT:  # %bb.9:
-; RV32I-NEXT:    sltu t1, a5, a6
-; RV32I-NEXT:    sub a7, a7, t0
+; RV32I-NEXT:    sltu t0, a5, a6
+; RV32I-NEXT:    sub a7, a7, t1
 ; RV32I-NEXT:    sub a5, a5, a6
 ; RV32I-NEXT:    sub a4, a4, a1
-; RV32I-NEXT:    sub a6, a7, t1
+; RV32I-NEXT:    sub a6, a7, t0
 ; RV32I-NEXT:    sltu a7, a5, t5
 ; RV32I-NEXT:    sub a1, a5, t5
 ; RV32I-NEXT:    sub a5, a4, t4
@@ -1845,10 +1845,10 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32I-NEXT:    sub a2, a3, a2
 ; RV32I-NEXT:    j .LBB22_11
 ; RV32I-NEXT:  .LBB22_10:
-; RV32I-NEXT:    sub a7, t0, a7
+; RV32I-NEXT:    sub a7, t1, a7
 ; RV32I-NEXT:    sub a5, a6, a5
 ; RV32I-NEXT:    sub a4, a1, a4
-; RV32I-NEXT:    sub a6, a7, t1
+; RV32I-NEXT:    sub a6, a7, t0
 ; RV32I-NEXT:    sltu a7, a5, t3
 ; RV32I-NEXT:    sub a1, a5, t3
 ; RV32I-NEXT:    sub a5, a4, t2
@@ -1888,15 +1888,15 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    lw a4, 4(a2)
 ; RV32ZBB-NEXT:    lw a5, 8(a2)
 ; RV32ZBB-NEXT:    lw a7, 12(a2)
-; RV32ZBB-NEXT:    lw a6, 8(a1)
-; RV32ZBB-NEXT:    lw t0, 12(a1)
 ; RV32ZBB-NEXT:    lw a2, 0(a1)
+; RV32ZBB-NEXT:    lw a6, 8(a1)
+; RV32ZBB-NEXT:    lw t1, 12(a1)
 ; RV32ZBB-NEXT:    lw a1, 4(a1)
-; RV32ZBB-NEXT:    sltu t1, a6, a5
-; RV32ZBB-NEXT:    mv t4, t1
-; RV32ZBB-NEXT:    beq t0, a7, .LBB22_2
+; RV32ZBB-NEXT:    sltu t0, a6, a5
+; RV32ZBB-NEXT:    mv t4, t0
+; RV32ZBB-NEXT:    beq t1, a7, .LBB22_2
 ; RV32ZBB-NEXT:  # %bb.1:
-; RV32ZBB-NEXT:    sltu t4, t0, a7
+; RV32ZBB-NEXT:    sltu t4, t1, a7
 ; RV32ZBB-NEXT:  .LBB22_2:
 ; RV32ZBB-NEXT:    sltu t2, a2, a3
 ; RV32ZBB-NEXT:    mv t3, t2
@@ -1904,7 +1904,7 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:  # %bb.3:
 ; RV32ZBB-NEXT:    sltu t3, a1, a4
 ; RV32ZBB-NEXT:  .LBB22_4:
-; RV32ZBB-NEXT:    xor t5, t0, a7
+; RV32ZBB-NEXT:    xor t5, t1, a7
 ; RV32ZBB-NEXT:    xor t6, a6, a5
 ; RV32ZBB-NEXT:    or t5, t6, t5
 ; RV32ZBB-NEXT:    mv t6, t3
@@ -1920,11 +1920,11 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:  .LBB22_8:
 ; RV32ZBB-NEXT:    bnez t6, .LBB22_10
 ; RV32ZBB-NEXT:  # %bb.9:
-; RV32ZBB-NEXT:    sltu t1, a5, a6
-; RV32ZBB-NEXT:    sub a7, a7, t0
+; RV32ZBB-NEXT:    sltu t0, a5, a6
+; RV32ZBB-NEXT:    sub a7, a7, t1
 ; RV32ZBB-NEXT:    sub a5, a5, a6
 ; RV32ZBB-NEXT:    sub a4, a4, a1
-; RV32ZBB-NEXT:    sub a6, a7, t1
+; RV32ZBB-NEXT:    sub a6, a7, t0
 ; RV32ZBB-NEXT:    sltu a7, a5, t5
 ; RV32ZBB-NEXT:    sub a1, a5, t5
 ; RV32ZBB-NEXT:    sub a5, a4, t4
@@ -1932,10 +1932,10 @@ define i128 @abd_cmp_i128(i128 %a, i128 %b) nounwind {
 ; RV32ZBB-NEXT:    sub a2, a3, a2
 ; RV32ZBB-NEXT:    j .LBB22_11
 ; RV32ZBB-NEXT:  .LBB22_10:
-; RV32ZBB-NEXT:    sub a7, t0, a7
+; RV32ZBB-NEXT:    sub a7, t1, a7
 ; RV32ZBB-NEXT:    sub a5, a6, a5
 ; RV32ZBB-NEXT:    sub a4, a1, a4
-; RV32ZBB-NEXT:    sub a6, a7, t1
+; RV32ZBB-NEXT:    sub a6, a7, t0
 ; RV32ZBB-NEXT:    sltu a7, a5, t3
 ; RV32ZBB-NEXT:    sub a1, a5, t3
 ; RV32ZBB-NEXT:    sub a5, a4, t2
diff --git a/llvm/test/CodeGen/RISCV/add-before-shl.ll b/llvm/test/CodeGen/RISCV/add-before-shl.ll
index b6ff3c9060af5..35a39b89a2cb7 100644
--- a/llvm/test/CodeGen/RISCV/add-before-shl.ll
+++ b/llvm/test/CodeGen/RISCV/add-before-shl.ll
@@ -200,26 +200,26 @@ define i128 @add_wide_operand(i128 %a) nounwind {
 ;
 ; RV32C-LABEL: add_wide_operand:
 ; RV32C:       # %bb.0:
+; RV32C-NEXT:    c.lw a2, 0(a1)
 ; RV32C-NEXT:    c.lw a4, 12(a1)
-; RV32C-NEXT:    c.lw a3, 0(a1)
-; RV32C-NEXT:    c.lw a2, 4(a1)
+; RV32C-NEXT:    c.lw a3, 4(a1)
 ; RV32C-NEXT:    c.lw a1, 8(a1)
 ; RV32C-NEXT:    c.lui a5, 16
 ; RV32C-NEXT:    add a6, a4, a5
-; RV32C-NEXT:    srli a5, a3, 29
-; RV32C-NEXT:    slli a4, a2, 3
+; RV32C-NEXT:    srli a5, a2, 29
+; RV32C-NEXT:    slli a4, a3, 3
 ; RV32C-NEXT:    c.or a4, a5
 ; RV32C-NEXT:    srli a5, a1, 29
-; RV32C-NEXT:    c.srli a2, 29
+; RV32C-NEXT:    c.srli a3, 29
 ; RV32C-NEXT:    c.slli a1, 3
-; RV32C-NEXT:    c.slli a3, 3
+; RV32C-NEXT:    c.slli a2, 3
 ; RV32C-NEXT:    c.slli a6, 3
-; RV32C-NEXT:    c.or a1, a2
-; RV32C-NEXT:    or a2, a6, a5
-; RV32C-NEXT:    c.sw a3, 0(a0)
+; RV32C-NEXT:    c.or a1, a3
+; RV32C-NEXT:    or a3, a6, a5
+; RV32C-NEXT:    c.sw a2, 0(a0)
 ; RV32C-NEXT:    c.sw a4, 4(a0)
 ; RV32C-NEXT:    c.sw a1, 8(a0)
-; RV32C-NEXT:    c.sw a2, 12(a0)
+; RV32C-NEXT:    c.sw a3, 12(a0)
 ; RV32C-NEXT:    c.jr ra
 ;
 ; RV64C-LABEL: add_wide_operand:
diff --git a/llvm/test/CodeGen/RISCV/fold-mem-offset.ll b/llvm/test/CodeGen/RISCV/fold-mem-offset.ll
index 7d8b8d29aa3c9..662948653486b 100644
--- a/llvm/test/CodeGen/RISCV/fold-mem-offset.ll
+++ b/llvm/test/CodeGen/RISCV/fold-mem-offset.ll
@@ -213,12 +213,12 @@ define i64 @test_sh3add_uw(ptr %p, i32 signext %x, i32 signext %y) {
 ; RV32I-NEXT:    slli a2, a2, 3
 ; RV32I-NEXT:    add a1, a0, a1
 ; RV32I-NEXT:    add a0, a0, a2
-; RV32I-NEXT:    lw a2, 404(a0)
-; RV32I-NEXT:    lw a3, 400(a1)
+; RV32I-NEXT:    lw a2, 400(a1)
 ; RV32I-NEXT:    lw a1, 404(a1)
+; RV32I-NEXT:    lw a3, 404(a0)
 ; RV32I-NEXT:    lw a4, 400(a0)
-; RV32I-NEXT:    add a1, a2, a1
-; RV32I-NEXT:    add a0, a4, a3
+; RV32I-NEXT:    add a1, a3, a1
+; RV32I-NEXT:    add a0, a4, a2
 ; RV32I-NEXT:    sltu a2, a0, a4
 ; RV32I-NEXT:    add a1, a1, a2
 ; RV32I-NEXT:    ret
@@ -240,12 +240,12 @@ define i64 @test_sh3add_uw(ptr %p, i32 signext %x, i32 signext %y) {
 ; RV32ZBA:       # %bb.0: # %entry
 ; RV32ZBA-NEXT:    sh3add a1, a1, a0
 ; RV32ZBA-NEXT:    sh3add a0, a2, a0
-; RV32ZBA-NEXT:    lw a2, 404(a0)
-; RV32ZBA-NEXT:    lw a3, 400(a1)
+; RV32ZBA-NEXT:    lw a2, 400(a1)
 ; RV32ZBA-NEXT:    lw a1, 404(a1)
+; RV32ZBA-NEXT:    lw a3, 404(a0)
 ; RV32ZBA-NEXT:    lw a4, 400(a0)
-; RV32ZBA-NEXT:    add a1, a2, a1
-; RV32ZBA-NEXT:    add a0, a4, a3
+; RV32ZBA-NEXT:    add a1, a3, a1
+; RV32ZBA-NEXT:    add a0, a4, a2
 ; RV32ZBA-NEXT:    sltu a2, a0, a4
 ; RV32ZBA-NEXT:    add a1, a1, a2
 ; RV32ZBA-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/legalize-fneg.ll b/llvm/test/CodeGen/RISCV/legalize-fneg.ll
index f60b77b92c09e..9e66eb7a2ae6c 100644
--- a/llvm/test/CodeGen/RISCV/legalize-fneg.ll
+++ b/llvm/test/CodeGen/RISCV/legalize-fneg.ll
@@ -56,16 +56,16 @@ entry:
 define void @test3(ptr %a, ptr %b) nounwind {
 ; RV32-LABEL: test3:
 ; RV32:       # %bb.0: # %entry
-; RV32-NEXT:    lw a2, 12(a1)
-; RV32-NEXT:    lw a3, 0(a1)
+; RV32-NEXT:    lw a2, 0(a1)
+; RV32-NEXT:    lw a3, 12(a1)
 ; RV32-NEXT:    lw a4, 4(a1)
 ; RV32-NEXT:    lw a1, 8(a1)
 ; RV32-NEXT:    lui a5, 524288
-; RV32-NEXT:    xor a2, a2, a5
-; RV32-NEXT:    sw a3, 0(a0)
+; RV32-NEXT:    xor a3, a3, a5
+; RV32-NEXT:    sw a2, 0(a0)
 ; RV32-NEXT:    sw a4, 4(a0)
 ; RV32-NEXT:    sw a1, 8(a0)
-; RV32-NEXT:    sw a2, 12(a0)
+; RV32-NEXT:    sw a3, 12(a0)
 ; RV32-NEXT:    ret
 ;
 ; RV64-LABEL: test3:
diff --git a/llvm/test/CodeGen/RISCV/memcmp-optsize.ll b/llvm/test/CodeGen/RISCV/memcmp-optsize.ll
index f9086ba9d6354..38cd51c074594 100644
--- a/llvm/test/CodeGen/RISCV/memcmp-optsize.ll
+++ b/llvm/test/CodeGen/RISCV/memcmp-optsize.ll
@@ -4418,16 +4418,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind optsize {
 ; CHECK-ALIGNED-RV32-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV32-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV32-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV32-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV32-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV32-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV32-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV32-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV32-NEXT:    lbu a0, 3(a0)
-; CHECK-ALIGNED-RV32-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV32-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV32-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV32-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV32-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV32-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV32-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV32-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV32-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV32-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV32-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV32-NEXT:    xor a0, a0, a1
@@ -4444,16 +4444,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind optsize {
 ; CHECK-ALIGNED-RV64-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV64-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV64-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV64-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV64-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV64-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV64-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV64-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV64-NEXT:    lb a0, 3(a0)
-; CHECK-ALIGNED-RV64-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV64-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV64-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV64-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV64-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV64-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV64-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV64-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV64-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV64-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV64-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV64-NEXT:    xor a0, a0, a1
@@ -4470,16 +4470,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind optsize {
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a0, 3(a0)
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    xor a0, a0, a1
@@ -4496,16 +4496,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind optsize {
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    lb a0, 3(a0)
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    xor a0, a0, a1
@@ -4566,16 +4566,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind optsize {
 ; CHECK-ALIGNED-RV32-V-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV32-V-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV32-V-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV32-V-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV32-V-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-V-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-V-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV32-V-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV32-V-NEXT:    lbu a0, 3(a0)
-; CHECK-ALIGNED-RV32-V-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV32-V-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV32-V-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV32-V-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV32-V-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV32-V-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV32-V-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV32-V-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV32-V-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV32-V-NEXT:    xor a0, a0, a1
@@ -4592,16 +4592,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind optsize {
 ; CHECK-ALIGNED-RV64-V-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV64-V-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV64-V-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV64-V-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV64-V-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-V-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-V-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV64-V-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV64-V-NEXT:    lb a0, 3(a0)
-; CHECK-ALIGNED-RV64-V-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV64-V-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV64-V-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV64-V-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV64-V-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV64-V-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV64-V-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV64-V-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV64-V-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV64-V-NEXT:    xor a0, a0, a1
diff --git a/llvm/test/CodeGen/RISCV/memcmp.ll b/llvm/test/CodeGen/RISCV/memcmp.ll
index f0290298e362a..df9d781a4536d 100644
--- a/llvm/test/CodeGen/RISCV/memcmp.ll
+++ b/llvm/test/CodeGen/RISCV/memcmp.ll
@@ -5988,16 +5988,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind {
 ; CHECK-ALIGNED-RV32-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV32-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV32-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV32-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV32-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV32-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV32-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV32-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV32-NEXT:    lbu a0, 3(a0)
-; CHECK-ALIGNED-RV32-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV32-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV32-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV32-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV32-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV32-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV32-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV32-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV32-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV32-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV32-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV32-NEXT:    xor a0, a0, a1
@@ -6014,16 +6014,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind {
 ; CHECK-ALIGNED-RV64-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV64-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV64-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV64-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV64-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV64-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV64-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV64-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV64-NEXT:    lb a0, 3(a0)
-; CHECK-ALIGNED-RV64-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV64-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV64-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV64-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV64-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV64-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV64-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV64-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV64-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV64-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV64-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV64-NEXT:    xor a0, a0, a1
@@ -6040,16 +6040,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind {
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    lbu a0, 3(a0)
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV32-ZBB-NEXT:    xor a0, a0, a1
@@ -6066,16 +6066,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind {
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    lb a0, 3(a0)
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV64-ZBB-NEXT:    xor a0, a0, a1
@@ -6136,16 +6136,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind {
 ; CHECK-ALIGNED-RV32-V-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV32-V-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV32-V-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV32-V-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV32-V-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-V-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-V-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV32-V-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV32-V-NEXT:    lbu a0, 3(a0)
-; CHECK-ALIGNED-RV32-V-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV32-V-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV32-V-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV32-V-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV32-V-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV32-V-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV32-V-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV32-V-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV32-V-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV32-V-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV32-V-NEXT:    xor a0, a0, a1
@@ -6162,16 +6162,16 @@ define i1 @memcmp_eq_zero(ptr %s1, ptr %s2) nounwind {
 ; CHECK-ALIGNED-RV64-V-NEXT:    slli a3, a3, 16
 ; CHECK-ALIGNED-RV64-V-NEXT:    slli a4, a4, 24
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a1, a2, a1
-; CHECK-ALIGNED-RV64-V-NEXT:    lbu a2, 0(a0)
-; CHECK-ALIGNED-RV64-V-NEXT:    lbu a5, 1(a0)
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a3, a4, a3
-; CHECK-ALIGNED-RV64-V-NEXT:    lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-V-NEXT:    lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-V-NEXT:    lbu a4, 0(a0)
+; CHECK-ALIGNED-RV64-V-NEXT:    lbu a5, 2(a0)
 ; CHECK-ALIGNED-RV64-V-NEXT:    lb a0, 3(a0)
-; CHECK-ALIGNED-RV64-V-NEXT:    slli a5, a5, 8
-; CHECK-ALIGNED-RV64-V-NEXT:    or a2, a5, a2
-; CHECK-ALIGNED-RV64-V-NEXT:    slli a4, a4, 16
+; CHECK-ALIGNED-RV64-V-NEXT:    slli a2, a2, 8
+; CHECK-ALIGNED-RV64-V-NEXT:    or a2, a2, a4
+; CHECK-ALIGNED-RV64-V-NEXT:    slli a5, a5, 16
 ; CHECK-ALIGNED-RV64-V-NEXT:    slli a0, a0, 24
-; CHECK-ALIGNED-RV64-V-NEXT:    or a0, a0, a4
+; CHECK-ALIGNED-RV64-V-NEXT:    or a0, a0, a5
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a1, a3, a1
 ; CHECK-ALIGNED-RV64-V-NEXT:    or a0, a0, a2
 ; CHECK-ALIGNED-RV64-V-NEXT:    xor a0, a0, a1
diff --git a/llvm/test/CodeGen/RISCV/rv32zbb.ll b/llvm/test/CodeGen/RISCV/rv32zbb.ll
index 98c86da41afa1..4a842f2208f1d 100644
--- a/llvm/test/CodeGen/RISCV/rv32zbb.ll
+++ b/llvm/test/CodeGen/RISCV/rv32zbb.ll
@@ -1580,8 +1580,8 @@ define i128 @sub_if_uge_i128(i128 %x, i128 %y) {
 ; CHECK-NEXT:    lw a7, 4(a2)
 ; CHECK-NEXT:    lw a6, 8(a2)
 ; CHECK-NEXT:    lw t0, 12(a2)
-; CHECK-NEXT:    lw a4, 12(a1)
 ; CHECK-NEXT:    lw a3, 4(a1)
+; CHECK-NEXT:    lw a4, 12(a1)
 ; CHECK-NEXT:    lw a5, 8(a1)
 ; CHECK-NEXT:    beq a4, t0, .LBB53_2
 ; CHECK-NEXT:  # %bb.1:
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll
index e13f4f4b50b0f..651894a1bb661 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll
@@ -26,26 +26,26 @@ define void @add_v4i32(ptr %x, ptr %y) {
 define void @add_v2i64(ptr %x, ptr %y) {
 ; RV32-LABEL: add_v2i64:
 ; RV32:       # %bb.0:
-; RV32-NEXT:    lw a2, 0(a1)
-; RV32-NEXT:    lw a3, 4(a1)
-; RV32-NEXT:    lw a4, 0(a0)
-; RV32-NEXT:    lw a5, 4(a0)
-; RV32-NEXT:    lw a6, 8(a0)
-; RV32-NEXT:    lw a7, 12(a0)
+; RV32-NEXT:    lw a2, 0(a0)
+; RV32-NEXT:    lw a3, 4(a0)
+; RV32-NEXT:    lw a4, 8(a0)
+; RV32-NEXT:    lw a5, 12(a0)
+; RV32-NEXT:    lw a6, 4(a1)
+; RV32-NEXT:    lw a7, 0(a1)
 ; RV32-NEXT:    lw t0, 12(a1)
 ; RV32-NEXT:    lw a1, 8(a1)
-; RV32-NEXT:    add a3, a5, a3
-; RV32-NEXT:    add a2, a4, a2
-; RV32-NEXT:    add a7, a7, t0
-; RV32-NEXT:    add a1, a6, a1
-; RV32-NEXT:    sltu a4, a2, a4
-; RV32-NEXT:    sltu a5, a1, a6
-; RV32-NEXT:    add a3, a3, a4
-; RV32-NEXT:    add a5, a7, a5
-; RV32-NEXT:    sw a2, 0(a0)
-; RV32-NEXT:    sw a3, 4(a0)
+; RV32-NEXT:    add a3, a3, a6
+; RV32-NEXT:    add a7, a2, a7
+; RV32-NEXT:    add a5, a5, t0
+; RV32-NEXT:    add a1, a4, a1
+; RV32-NEXT:    sltu a2, a7, a2
+; RV32-NEXT:    sltu a4, a1, a4
+; RV32-NEXT:    add a2, a3, a2
+; RV32-NEXT:    add a4, a5, a4
+; RV32-NEXT:    sw a7, 0(a0)
+; RV32-NEXT:    sw a2, 4(a0)
 ; RV32-NEXT:    sw a1, 8(a0)
-; RV32-NEXT:    sw a5, 12(a0)
+; RV32-NEXT:    sw a4, 12(a0)
 ; RV32-NEXT:    ret
 ;
 ; RV64-LABEL: add_v2i64:
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
index aa55bd7af59c5..d9d1ab25f2d5c 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
@@ -1410,24 +1410,24 @@ define <16 x i8> @buildvec_v16i8_loads_contigous(ptr %p) {
 ; RV32VB-NEXT:    slli t1, t1, 24
 ; RV32VB-NEXT:    or a7, t0, a7
 ; RV32VB-NEXT:    or a4, a4, a5
-; RV32VB-NEXT:    lbu a5, 12(a0)
-; RV32VB-NEXT:    lbu t0, 13(a0)
-; RV32VB-NEXT:    or a6, t1, a6
+; RV32VB-NEXT:    or a5, t1, a6
+; RV32VB-NEXT:    lbu a6, 13(a0)
+; RV32VB-NEXT:    lbu t0, 12(a0)
 ; RV32VB-NEXT:    lbu t1, 14(a0)
 ; RV32VB-NEXT:    lbu a0, 15(a0)
-; RV32VB-NEXT:    slli t0, t0, 8
-; RV32VB-NEXT:    or a5, a5, t0
+; RV32VB-NEXT:    slli a6, a6, 8
+; RV32VB-NEXT:    or a6, t0, a6
 ; RV32VB-NEXT:    slli t1, t1, 16
 ; RV32VB-NEXT:    slli a0, a0, 24
 ; RV32VB-NEXT:    or a0, a0, t1
 ; RV32VB-NEXT:    or a1, a1, a3
 ; RV32VB-NEXT:    or a2, a2, a7
-; RV32VB-NEXT:    or a3, a4, a6
-; RV32VB-NEXT:    or a0, a5, a0
+; RV32VB-NEXT:    or a4, a4, a5
+; RV32VB-NEXT:    or a0, a6, a0
 ; RV32VB-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
 ; RV32VB-NEXT:    vmv.v.x v8, a1
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a2
-; RV32VB-NEXT:    vslide1down.vx v8, v8, a3
+; RV32VB-NEXT:    vslide1down.vx v8, v8, a4
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a0
 ; RV32VB-NEXT:    ret
 ;
@@ -1770,38 +1770,38 @@ define <16 x i8> @buildvec_v16i8_loads_gather(ptr %p) {
 ; RV32VB-NEXT:    slli a2, a2, 8
 ; RV32VB-NEXT:    slli a3, a3, 16
 ; RV32VB-NEXT:    slli a4, a4, 24
+; RV32VB-NEXT:    slli a7, a7, 8
 ; RV32VB-NEXT:    or a1, a1, a2
 ; RV32VB-NEXT:    or a3, a4, a3
-; RV32VB-NEXT:    lbu a2, 93(a0)
-; RV32VB-NEXT:    lbu a4, 105(a0)
-; RV32VB-NEXT:    lbu t2, 124(a0)
-; RV32VB-NEXT:    lbu t3, 144(a0)
-; RV32VB-NEXT:    slli a7, a7, 8
+; RV32VB-NEXT:    or a2, a6, a7
+; RV32VB-NEXT:    lbu a4, 93(a0)
+; RV32VB-NEXT:    lbu a6, 105(a0)
+; RV32VB-NEXT:    lbu a7, 124(a0)
+; RV32VB-NEXT:    lbu t2, 144(a0)
 ; RV32VB-NEXT:    slli a5, a5, 16
 ; RV32VB-NEXT:    slli t0, t0, 24
-; RV32VB-NEXT:    slli a2, a2, 8
-; RV32VB-NEXT:    or a6, a6, a7
+; RV32VB-NEXT:    slli a4, a4, 8
 ; RV32VB-NEXT:    or a5, t0, a5
-; RV32VB-NEXT:    lbu a7, 154(a0)
+; RV32VB-NEXT:    or a4, t1, a4
 ; RV32VB-NEXT:    lbu t0, 161(a0)
-; RV32VB-NEXT:    or a2, t1, a2
+; RV32VB-NEXT:    lbu t1, 154(a0)
 ; RV32VB-NEXT:    lbu a0, 163(a0)
-; RV32VB-NEXT:    slli a4, a4, 16
+; RV32VB-NEXT:    slli a6, a6, 16
 ; RV32VB-NEXT:    slli t0, t0, 24
-; RV32VB-NEXT:    or a4, t0, a4
+; RV32VB-NEXT:    or a6, t0, a6
 ; RV32VB-NEXT:    slli a0, a0, 8
-; RV32VB-NEXT:    or a0, t2, a0
-; RV32VB-NEXT:    slli t3, t3, 16
-; RV32VB-NEXT:    slli a7, a7, 24
-; RV32VB-NEXT:    or a7, a7, t3
+; RV32VB-NEXT:    or a0, a7, a0
+; RV32VB-NEXT:    slli t2, t2, 16
+; RV32VB-NEXT:    slli t1, t1, 24
+; RV32VB-NEXT:    or a7, t1, t2
 ; RV32VB-NEXT:    or a1, a1, a3
-; RV32VB-NEXT:    or a3, a6, a5
-; RV32VB-NEXT:    or a2, a2, a4
+; RV32VB-NEXT:    or a2, a2, a5
+; RV32VB-NEXT:    or a3, a4, a6
 ; RV32VB-NEXT:    or a0, a0, a7
 ; RV32VB-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
 ; RV32VB-NEXT:    vmv.v.x v8, a1
-; RV32VB-NEXT:    vslide1down.vx v8, v8, a3
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a2
+; RV32VB-NEXT:    vslide1down.vx v8, v8, a3
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a0
 ; RV32VB-NEXT:    ret
 ;
@@ -1893,52 +1893,52 @@ define <16 x i8> @buildvec_v16i8_loads_gather(ptr %p) {
 ;
 ; RVA22U64-LABEL: buildvec_v16i8_loads_gather:
 ; RVA22U64:       # %bb.0:
-; RVA22U64-NEXT:    lbu a1, 0(a0)
+; RVA22U64-NEXT:    lbu a7, 0(a0)
 ; RVA22U64-NEXT:    lbu a2, 1(a0)
 ; RVA22U64-NEXT:    lbu a3, 22(a0)
 ; RVA22U64-NEXT:    lbu a4, 31(a0)
 ; RVA22U64-NEXT:    lbu a6, 623(a0)
-; RVA22U64-NEXT:    lbu t0, 44(a0)
-; RVA22U64-NEXT:    lbu a7, 55(a0)
-; RVA22U64-NEXT:    lbu a5, 75(a0)
+; RVA22U64-NEXT:    lbu a5, 44(a0)
+; RVA22U64-NEXT:    lbu a1, 55(a0)
+; RVA22U64-NEXT:    lbu t0, 75(a0)
 ; RVA22U64-NEXT:    lbu t1, 82(a0)
 ; RVA22U64-NEXT:    slli a2, a2, 8
 ; RVA22U64-NEXT:    slli a3, a3, 16
 ; RVA22U64-NEXT:    slli a4, a4, 24
-; RVA22U64-NEXT:    or t2, a1, a2
+; RVA22U64-NEXT:    slli a5, a5, 32
+; RVA22U64-NEXT:    slli a1, a1, 40
+; RVA22U64-NEXT:    or a7, a7, a2
 ; RVA22U64-NEXT:    or t3, a4, a3
-; RVA22U64-NEXT:    lbu a2, 93(a0)
+; RVA22U64-NEXT:    or t2, a1, a5
+; RVA22U64-NEXT:    lbu a4, 93(a0)
 ; RVA22U64-NEXT:    lbu t4, 105(a0)
-; RVA22U64-NEXT:    lbu t6, 124(a0)
+; RVA22U64-NEXT:    lbu a2, 124(a0)
 ; RVA22U64-NEXT:    lbu t5, 144(a0)
-; RVA22U64-NEXT:    slli t0, t0, 32
-; RVA22U64-NEXT:    slli a7, a7, 40
 ; RVA22U64-NEXT:    slli a6, a6, 48
-; RVA22U64-NEXT:    slli a5, a5, 56
-; RVA22U64-NEXT:    slli a2, a2, 8
-; RVA22U64-NEXT:    or a7, a7, t0
-; RVA22U64-NEXT:    or a5, a5, a6
-; RVA22U64-NEXT:    lbu a3, 154(a0)
-; RVA22U64-NEXT:    lbu a1, 161(a0)
-; RVA22U64-NEXT:    or a2, t1, a2
+; RVA22U64-NEXT:    slli t0, t0, 56
+; RVA22U64-NEXT:    slli a4, a4, 8
+; RVA22U64-NEXT:    or a3, t0, a6
+; RVA22U64-NEXT:    or a4, t1, a4
+; RVA22U64-NEXT:    lbu a5, 161(a0)
+; RVA22U64-NEXT:    lbu a1, 154(a0)
 ; RVA22U64-NEXT:    lbu a0, 163(a0)
 ; RVA22U64-NEXT:    slli t4, t4, 16
-; RVA22U64-NEXT:    slli a1, a1, 24
-; RVA22U64-NEXT:    or a1, a1, t4
-; RVA22U64-NEXT:    slli t6, t6, 32
+; RVA22U64-NEXT:    slli a5, a5, 24
+; RVA22U64-NEXT:    or a5, a5, t4
+; RVA22U64-NEXT:    slli a2, a2, 32
 ; RVA22U64-NEXT:    slli a0, a0, 40
-; RVA22U64-NEXT:    or a0, a0, t6
+; RVA22U64-NEXT:    or a0, a0, a2
 ; RVA22U64-NEXT:    slli t5, t5, 48
-; RVA22U64-NEXT:    slli a3, a3, 56
-; RVA22U64-NEXT:    or a3, a3, t5
-; RVA22U64-NEXT:    or a4, t2, t3
-; RVA22U64-NEXT:    or a5, a5, a7
-; RVA22U64-NEXT:    or a1, a1, a2
-; RVA22U64-NEXT:    or a0, a0, a3
+; RVA22U64-NEXT:    slli a1, a1, 56
+; RVA22U64-NEXT:    or a1, a1, t5
+; RVA22U64-NEXT:    or a2, a7, t3
+; RVA22U64-NEXT:    or a3, a3, t2
 ; RVA22U64-NEXT:    or a4, a4, a5
 ; RVA22U64-NEXT:    or a0, a0, a1
+; RVA22U64-NEXT:    or a2, a2, a3
+; RVA22U64-NEXT:    or a0, a0, a4
 ; RVA22U64-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
-; RVA22U64-NEXT:    vmv.v.x v8, a4
+; RVA22U64-NEXT:    vmv.v.x v8, a2
 ; RVA22U64-NEXT:    vslide1down.vx v8, v8, a0
 ; RVA22U64-NEXT:    ret
 ;
@@ -2116,14 +2116,14 @@ define <16 x i8> @buildvec_v16i8_undef_low_half(ptr %p) {
 ; RV32VB-NEXT:    lbu a3, 105(a0)
 ; RV32VB-NEXT:    lbu a4, 124(a0)
 ; RV32VB-NEXT:    slli a1, a1, 8
+; RV32VB-NEXT:    or a1, a2, a1
+; RV32VB-NEXT:    lbu a2, 161(a0)
 ; RV32VB-NEXT:    lbu a5, 144(a0)
 ; RV32VB-NEXT:    lbu a6, 154(a0)
-; RV32VB-NEXT:    lbu a7, 161(a0)
-; RV32VB-NEXT:    or a1, a2, a1
 ; RV32VB-NEXT:    lbu a0, 163(a0)
 ; RV32VB-NEXT:    slli a3, a3, 16
-; RV32VB-NEXT:    slli a7, a7, 24
-; RV32VB-NEXT:    or a2, a7, a3
+; RV32VB-NEXT:    slli a2, a2, 24
+; RV32VB-NEXT:    or a2, a2, a3
 ; RV32VB-NEXT:    slli a0, a0, 8
 ; RV32VB-NEXT:    or a0, a4, a0
 ; RV32VB-NEXT:    slli a5, a5, 16
@@ -2187,27 +2187,27 @@ define <16 x i8> @buildvec_v16i8_undef_low_half(ptr %p) {
 ; RVA22U64-LABEL: buildvec_v16i8_undef_low_half:
 ; RVA22U64:       # %bb.0:
 ; RVA22U64-NEXT:    lbu a1, 93(a0)
-; RVA22U64-NEXT:    lbu a6, 82(a0)
-; RVA22U64-NEXT:    lbu a7, 105(a0)
+; RVA22U64-NEXT:    lbu a2, 82(a0)
+; RVA22U64-NEXT:    lbu a3, 105(a0)
 ; RVA22U64-NEXT:    lbu a4, 124(a0)
 ; RVA22U64-NEXT:    slli a1, a1, 8
+; RVA22U64-NEXT:    or a6, a2, a1
+; RVA22U64-NEXT:    lbu a2, 161(a0)
 ; RVA22U64-NEXT:    lbu a5, 144(a0)
-; RVA22U64-NEXT:    lbu a2, 154(a0)
-; RVA22U64-NEXT:    lbu a3, 161(a0)
-; RVA22U64-NEXT:    or a1, a6, a1
+; RVA22U64-NEXT:    lbu a1, 154(a0)
 ; RVA22U64-NEXT:    lbu a0, 163(a0)
-; RVA22U64-NEXT:    slli a7, a7, 16
-; RVA22U64-NEXT:    slli a3, a3, 24
-; RVA22U64-NEXT:    or a3, a3, a7
+; RVA22U64-NEXT:    slli a3, a3, 16
+; RVA22U64-NEXT:    slli a2, a2, 24
+; RVA22U64-NEXT:    or a2, a2, a3
 ; RVA22U64-NEXT:    slli a4, a4, 32
 ; RVA22U64-NEXT:    slli a0, a0, 40
 ; RVA22U64-NEXT:    or a0, a0, a4
 ; RVA22U64-NEXT:    slli a5, a5, 48
-; RVA22U64-NEXT:    slli a2, a2, 56
-; RVA22U64-NEXT:    or a2, a2, a5
-; RVA22U64-NEXT:    or a1, a1, a3
-; RVA22U64-NEXT:    or a0, a0, a2
+; RVA22U64-NEXT:    slli a1, a1, 56
+; RVA22U64-NEXT:    or a1, a1, a5
+; RVA22U64-NEXT:    or a2, a6, a2
 ; RVA22U64-NEXT:    or a0, a0, a1
+; RVA22U64-NEXT:    or a0, a0, a2
 ; RVA22U64-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
 ; RVA22U64-NEXT:    vmv.v.i v8, 0
 ; RVA22U64-NEXT:    vslide1down.vx v8, v8, a0
@@ -2313,25 +2313,25 @@ define <16 x i8> @buildvec_v16i8_undef_high_half(ptr %p) {
 ; RV32VB-LABEL: buildvec_v16i8_undef_high_half:
 ; RV32VB:       # %bb.0:
 ; RV32VB-NEXT:    lbu a1, 1(a0)
-; RV32VB-NEXT:    lbu a2, 22(a0)
-; RV32VB-NEXT:    lbu a3, 31(a0)
-; RV32VB-NEXT:    lbu a4, 0(a0)
+; RV32VB-NEXT:    lbu a2, 0(a0)
+; RV32VB-NEXT:    lbu a3, 22(a0)
+; RV32VB-NEXT:    lbu a4, 31(a0)
 ; RV32VB-NEXT:    slli a1, a1, 8
-; RV32VB-NEXT:    slli a2, a2, 16
-; RV32VB-NEXT:    slli a3, a3, 24
-; RV32VB-NEXT:    or a1, a4, a1
-; RV32VB-NEXT:    lbu a4, 44(a0)
+; RV32VB-NEXT:    or a1, a2, a1
+; RV32VB-NEXT:    lbu a2, 44(a0)
 ; RV32VB-NEXT:    lbu a5, 55(a0)
-; RV32VB-NEXT:    or a2, a3, a2
-; RV32VB-NEXT:    lbu a3, 623(a0)
+; RV32VB-NEXT:    slli a3, a3, 16
+; RV32VB-NEXT:    slli a4, a4, 24
+; RV32VB-NEXT:    or a3, a4, a3
+; RV32VB-NEXT:    lbu a4, 623(a0)
 ; RV32VB-NEXT:    lbu a0, 75(a0)
 ; RV32VB-NEXT:    slli a5, a5, 8
-; RV32VB-NEXT:    or a4, a4, a5
-; RV32VB-NEXT:    slli a3, a3, 16
+; RV32VB-NEXT:    or a2, a2, a5
+; RV32VB-NEXT:    slli a4, a4, 16
 ; RV32VB-NEXT:    slli a0, a0, 24
-; RV32VB-NEXT:    or a0, a0, a3
-; RV32VB-NEXT:    or a1, a1, a2
-; RV32VB-NEXT:    or a0, a4, a0
+; RV32VB-NEXT:    or a0, a0, a4
+; RV32VB-NEXT:    or a1, a1, a3
+; RV32VB-NEXT:    or a0, a2, a0
 ; RV32VB-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
 ; RV32VB-NEXT:    vmv.v.x v8, a1
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a0
@@ -2389,26 +2389,26 @@ define <16 x i8> @buildvec_v16i8_undef_high_half(ptr %p) {
 ; RVA22U64-LABEL: buildvec_v16i8_undef_high_half:
 ; RVA22U64:       # %bb.0:
 ; RVA22U64-NEXT:    lbu a1, 1(a0)
-; RVA22U64-NEXT:    lbu a2, 22(a0)
-; RVA22U64-NEXT:    lbu a3, 31(a0)
-; RVA22U64-NEXT:    lbu a4, 0(a0)
+; RVA22U64-NEXT:    lbu a2, 0(a0)
+; RVA22U64-NEXT:    lbu a3, 22(a0)
+; RVA22U64-NEXT:    lbu a4, 31(a0)
 ; RVA22U64-NEXT:    slli a1, a1, 8
-; RVA22U64-NEXT:    slli a2, a2, 16
-; RVA22U64-NEXT:    slli a3, a3, 24
-; RVA22U64-NEXT:    or a1, a1, a4
-; RVA22U64-NEXT:    or a2, a2, a3
-; RVA22U64-NEXT:    lbu a3, 44(a0)
-; RVA22U64-NEXT:    lbu a4, 55(a0)
-; RVA22U64-NEXT:    lbu a5, 623(a0)
-; RVA22U64-NEXT:    lbu a0, 75(a0)
-; RVA22U64-NEXT:    slli a3, a3, 32
-; RVA22U64-NEXT:    slli a4, a4, 40
+; RVA22U64-NEXT:    or a1, a1, a2
+; RVA22U64-NEXT:    lbu a2, 44(a0)
+; RVA22U64-NEXT:    lbu a5, 55(a0)
+; RVA22U64-NEXT:    slli a3, a3, 16
+; RVA22U64-NEXT:    slli a4, a4, 24
 ; RVA22U64-NEXT:    or a3, a3, a4
-; RVA22U64-NEXT:    slli a5, a5, 48
+; RVA22U64-NEXT:    lbu a4, 623(a0)
+; RVA22U64-NEXT:    lbu a0, 75(a0)
+; RVA22U64-NEXT:    slli a2, a2, 32
+; RVA22U64-NEXT:    slli a5, a5, 40
+; RVA22U64-NEXT:    or a2, a2, a5
+; RVA22U64-NEXT:    slli a4, a4, 48
 ; RVA22U64-NEXT:    slli a0, a0, 56
-; RVA22U64-NEXT:    or a0, a0, a5
-; RVA22U64-NEXT:    or a1, a1, a2
-; RVA22U64-NEXT:    or a0, a0, a3
+; RVA22U64-NEXT:    or a0, a0, a4
+; RVA22U64-NEXT:    or a1, a1, a3
+; RVA22U64-NEXT:    or a0, a0, a2
 ; RVA22U64-NEXT:    or a0, a0, a1
 ; RVA22U64-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
 ; RVA22U64-NEXT:    vmv.v.x v8, a0
@@ -2522,28 +2522,28 @@ define <16 x i8> @buildvec_v16i8_undef_edges(ptr %p) {
 ; RV32VB:       # %bb.0:
 ; RV32VB-NEXT:    lbu a1, 623(a0)
 ; RV32VB-NEXT:    lbu a2, 55(a0)
-; RV32VB-NEXT:    lbu a3, 75(a0)
-; RV32VB-NEXT:    lbu a4, 31(a0)
+; RV32VB-NEXT:    lbu a3, 31(a0)
+; RV32VB-NEXT:    lbu a4, 75(a0)
 ; RV32VB-NEXT:    lbu a5, 44(a0)
 ; RV32VB-NEXT:    slli a2, a2, 8
 ; RV32VB-NEXT:    slli a1, a1, 16
-; RV32VB-NEXT:    slli a3, a3, 24
+; RV32VB-NEXT:    slli a4, a4, 24
 ; RV32VB-NEXT:    or a2, a5, a2
+; RV32VB-NEXT:    or a1, a4, a1
+; RV32VB-NEXT:    lbu a4, 93(a0)
 ; RV32VB-NEXT:    lbu a5, 82(a0)
-; RV32VB-NEXT:    lbu a6, 93(a0)
-; RV32VB-NEXT:    or a1, a3, a1
-; RV32VB-NEXT:    lbu a3, 105(a0)
+; RV32VB-NEXT:    lbu a6, 105(a0)
 ; RV32VB-NEXT:    lbu a0, 161(a0)
-; RV32VB-NEXT:    slli a6, a6, 8
-; RV32VB-NEXT:    or a5, a5, a6
-; RV32VB-NEXT:    slli a3, a3, 16
+; RV32VB-NEXT:    slli a4, a4, 8
+; RV32VB-NEXT:    or a4, a5, a4
+; RV32VB-NEXT:    slli a6, a6, 16
 ; RV32VB-NEXT:    slli a0, a0, 24
-; RV32VB-NEXT:    or a0, a0, a3
-; RV32VB-NEXT:    slli a4, a4, 24
+; RV32VB-NEXT:    or a0, a0, a6
+; RV32VB-NEXT:    slli a3, a3, 24
 ; RV32VB-NEXT:    or a1, a2, a1
-; RV32VB-NEXT:    or a0, a5, a0
+; RV32VB-NEXT:    or a0, a4, a0
 ; RV32VB-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; RV32VB-NEXT:    vmv.v.x v8, a4
+; RV32VB-NEXT:    vmv.v.x v8, a3
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a1
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, a0
 ; RV32VB-NEXT:    vslide1down.vx v8, v8, zero
@@ -2607,32 +2607,32 @@ define <16 x i8> @buildvec_v16i8_undef_edges(ptr %p) {
 ;
 ; RVA22U64-LABEL: buildvec_v16i8_undef_edges:
 ; RVA22U64:       # %bb.0:
-; RVA22U64-NEXT:    lbu a6, 31(a0)
+; RVA22U64-NEXT:    lbu a1, 623(a0)
 ; RVA22U64-NEXT:    lbu a2, 44(a0)
 ; RVA22U64-NEXT:    lbu a3, 55(a0)
-; RVA22U64-NEXT:    lbu a4, 623(a0)
+; RVA22U64-NEXT:    lbu a6, 31(a0)
 ; RVA22U64-NEXT:    lbu a5, 75(a0)
 ; RVA22U64-NEXT:    slli a2, a2, 32
 ; RVA22U64-NEXT:    slli a3, a3, 40
-; RVA22U64-NEXT:    slli a4, a4, 48
+; RVA22U64-NEXT:    slli a1, a1, 48
 ; RVA22U64-NEXT:    slli a5, a5, 56
 ; RVA22U64-NEXT:    or a2, a2, a3
-; RVA22U64-NEXT:    lbu a3, 82(a0)
-; RVA22U64-NEXT:    lbu a1, 93(a0)
-; RVA22U64-NEXT:    or a4, a4, a5
-; RVA22U64-NEXT:    lbu a5, 105(a0)
+; RVA22U64-NEXT:    or a1, a1, a5
+; RVA22U64-NEXT:    lbu a3, 93(a0)
+; RVA22U64-NEXT:    lbu a5, 82(a0)
+; RVA22U64-NEXT:    lbu a4, 105(a0)
 ; RVA22U64-NEXT:    lbu a0, 161(a0)
-; RVA22U64-NEXT:    slli a1, a1, 8
-; RVA22U64-NEXT:    or a1, a1, a3
-; RVA22U64-NEXT:    slli a5, a5, 16
+; RVA22U64-NEXT:    slli a3, a3, 8
+; RVA22U64-NEXT:    or a3, a3, a5
+; RVA22U64-NEXT:    slli a4, a4, 16
 ; RVA22U64-NEXT:    slli a0, a0, 24
-; RVA22U64-NEXT:    or a0, a0, a5
+; RVA22U64-NEXT:    or a0, a0, a4
 ; RVA22U64-NEXT:    slli a6, a6, 24
-; RVA22U64-NEXT:    or a2, a2, a4
-; RVA22U64-NEXT:    add.uw a2, a6, a2
-; RVA22U64-NEXT:    or a0, a0, a1
+; RVA22U64-NEXT:    or a1, a1, a2
+; RVA22U64-NEXT:    add.uw a1, a6, a1
+; RVA22U64-NEXT:    or a0, a0, a3
 ; RVA22U64-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
-; RVA22U64-NEXT:    vmv.v.x v8, a2
+; RVA22U64-NEXT:    vmv.v.x v8, a1
 ; RVA22U64-NEXT:    vslide1down.vx v8, v8, a0
 ; RVA22U64-NEXT:    ret
 ;
@@ -2794,26 +2794,26 @@ define <16 x i8> @buildvec_v16i8_loads_undef_scattered(ptr %p) {
 ; RV32VB-PACK-NEXT:    lbu a2, 1(a0)
 ; RV32VB-PACK-NEXT:    lbu a3, 44(a0)
 ; RV32VB-PACK-NEXT:    lbu a4, 55(a0)
-; RV32VB-PACK-NEXT:    lbu a5, 75(a0)
-; RV32VB-PACK-NEXT:    lbu a6, 82(a0)
-; RV32VB-PACK-NEXT:    lbu a7, 93(a0)
+; RV32VB-PACK-NEXT:    lbu a5, 82(a0)
+; RV32VB-PACK-NEXT:    lbu a6, 93(a0)
 ; RV32VB-PACK-NEXT:    packh a1, a1, a2
 ; RV32VB-PACK-NEXT:    lbu a2, 144(a0)
-; RV32VB-PACK-NEXT:    lbu t0, 154(a0)
+; RV32VB-PACK-NEXT:    lbu a7, 154(a0)
 ; RV32VB-PACK-NEXT:    packh a3, a3, a4
+; RV32VB-PACK-NEXT:    lbu a4, 75(a0)
 ; RV32VB-PACK-NEXT:    lbu a0, 124(a0)
-; RV32VB-PACK-NEXT:    packh a4, a6, a7
-; RV32VB-PACK-NEXT:    packh a2, a2, t0
-; RV32VB-PACK-NEXT:    packh a5, a0, a5
-; RV32VB-PACK-NEXT:    pack a3, a3, a5
-; RV32VB-PACK-NEXT:    packh a5, a0, a0
+; RV32VB-PACK-NEXT:    packh a5, a5, a6
+; RV32VB-PACK-NEXT:    packh a2, a2, a7
+; RV32VB-PACK-NEXT:    packh a4, a0, a4
+; RV32VB-PACK-NEXT:    pack a3, a3, a4
+; RV32VB-PACK-NEXT:    packh a4, a0, a0
 ; RV32VB-PACK-NEXT:    packh a0, a0, a0
 ; RV32VB-PACK-NEXT:    pack a0, a0, a2
-; RV32VB-PACK-NEXT:    pack a1, a1, a5
+; RV32VB-PACK-NEXT:    pack a1, a1, a4
 ; RV32VB-PACK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
 ; RV32VB-PACK-NEXT:    vmv.v.x v8, a1
 ; RV32VB-PACK-NEXT:    vslide1down.vx v8, v8, a3
-; RV32VB-PACK-NEXT:    pack a1, a4, a5
+; RV32VB-PACK-NEXT:    pack a1, a5, a4
 ; RV32VB-PACK-NEXT:    vslide1down.vx v8, v8, a1
 ; RV32VB-PACK-NEXT:    vslide1down.vx v8, v8, a0
 ; RV32VB-PACK-NEXT:    ret
@@ -2888,23 +2888,23 @@ define <16 x i8> @buildvec_v16i8_loads_undef_scattered(ptr %p) {
 ; RVA22U64-PACK:       # %bb.0:
 ; RVA22U64-PACK-NEXT:    lbu a1, 0(a0)
 ; RVA22U64-PACK-NEXT:    lbu a2, 1(a0)
-; RVA22U64-PACK-NEXT:    lbu a7, 44(a0)
-; RVA22U64-PACK-NEXT:    lbu t0, 55(a0)
-; RVA22U64-PACK-NEXT:    lbu a6, 75(a0)
-; RVA22U64-PACK-NEXT:    lbu a5, 82(a0)
+; RVA22U64-PACK-NEXT:    lbu a6, 44(a0)
+; RVA22U64-PACK-NEXT:    lbu a7, 55(a0)
+; RVA22U64-PACK-NEXT:    lbu t1, 82(a0)
 ; RVA22U64-PACK-NEXT:    lbu a3, 93(a0)
-; RVA22U64-PACK-NEXT:    packh t1, a1, a2
+; RVA22U64-PACK-NEXT:    packh t0, a1, a2
 ; RVA22U64-PACK-NEXT:    lbu a2, 144(a0)
 ; RVA22U64-PACK-NEXT:    lbu a4, 154(a0)
-; RVA22U64-PACK-NEXT:    packh a1, a7, t0
+; RVA22U64-PACK-NEXT:    packh a1, a6, a7
+; RVA22U64-PACK-NEXT:    lbu a5, 75(a0)
 ; RVA22U64-PACK-NEXT:    lbu a0, 124(a0)
-; RVA22U64-PACK-NEXT:    packh a3, a5, a3
+; RVA22U64-PACK-NEXT:    packh a3, t1, a3
 ; RVA22U64-PACK-NEXT:    packh a2, a2, a4
-; RVA22U64-PACK-NEXT:    packh a4, a0, a6
+; RVA22U64-PACK-NEXT:    packh a4, a0, a5
 ; RVA22U64-PACK-NEXT:    packw a1, a1, a4
 ; RVA22U64-PACK-NEXT:    packh a4, a0, a0
 ; RVA22U64-PACK-NEXT:    packh a0, a0, a0
-; RVA22U64-PACK-NEXT:    packw a5, t1, a4
+; RVA22U64-PACK-NEXT:    packw a5, t0, a4
 ; RVA22U64-PACK-NEXT:    packw a0, a0, a2
 ; RVA22U64-PACK-NEXT:    packw a2, a3, a4
 ; RVA22U64-PACK-NEXT:    pack a1, a5, a1
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
index 76eca8e034303..db339755e73c5 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
@@ -14204,13 +14204,13 @@ define <8 x i16> @mgather_strided_unaligned(ptr %base) {
 ; RV64ZVE32F-NEXT:    slli t2, t2, 8
 ; RV64ZVE32F-NEXT:    or a6, t0, a7
 ; RV64ZVE32F-NEXT:    or a2, a4, a2
-; RV64ZVE32F-NEXT:    lbu a4, 24(a0)
+; RV64ZVE32F-NEXT:    or a4, t2, t1
 ; RV64ZVE32F-NEXT:    lbu a7, 25(a0)
-; RV64ZVE32F-NEXT:    or t0, t2, t1
+; RV64ZVE32F-NEXT:    lbu t0, 24(a0)
 ; RV64ZVE32F-NEXT:    lbu t1, 28(a0)
 ; RV64ZVE32F-NEXT:    lbu a0, 29(a0)
 ; RV64ZVE32F-NEXT:    slli a7, a7, 8
-; RV64ZVE32F-NEXT:    or a4, a7, a4
+; RV64ZVE32F-NEXT:    or a7, a7, t0
 ; RV64ZVE32F-NEXT:    vsetivli zero, 8, e16, m1, ta, mu
 ; RV64ZVE32F-NEXT:    vmv.v.i v0, 15
 ; RV64ZVE32F-NEXT:    slli a0, a0, 8
@@ -14218,9 +14218,9 @@ define <8 x i16> @mgather_strided_unaligned(ptr %base) {
 ; RV64ZVE32F-NEXT:    vmv.v.x v8, a1
 ; RV64ZVE32F-NEXT:    vmv.v.x v9, a2
 ; RV64ZVE32F-NEXT:    vslide1down.vx v8, v8, a3
-; RV64ZVE32F-NEXT:    vslide1down.vx v9, v9, t0
-; RV64ZVE32F-NEXT:    vslide1down.vx v8, v8, a5
 ; RV64ZVE32F-NEXT:    vslide1down.vx v9, v9, a4
+; RV64ZVE32F-NEXT:    vslide1down.vx v8, v8, a5
+; RV64ZVE32F-NEXT:    vslide1down.vx v9, v9, a7
 ; RV64ZVE32F-NEXT:    vslide1down.vx v10, v8, a6
 ; RV64ZVE32F-NEXT:    vslide1down.vx v8, v9, a0
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v10, 4, v0.t
diff --git a/llvm/test/CodeGen/RISCV/rvv/pr125306.ll b/llvm/test/CodeGen/RISCV/rvv/pr125306.ll
index 9400c381bc87c..eee57f489cb10 100644
--- a/llvm/test/CodeGen/RISCV/rvv/pr125306.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/pr125306.ll
@@ -43,24 +43,24 @@ define <2 x i32> @main(ptr %0) {
 ; CHECK-NEXT:    sh zero, -1710(a5)
 ; CHECK-NEXT:    sh zero, -784(a5)
 ; CHECK-NEXT:    sh zero, 142(a5)
-; CHECK-NEXT:    lw a5, -304(a1)
+; CHECK-NEXT:    lw a5, 1244(a1)
 ; CHECK-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; CHECK-NEXT:    vadd.vi v9, v11, -1
 ; CHECK-NEXT:    vse32.v v10, (a3)
+; CHECK-NEXT:    lw a3, -188(a1)
 ; CHECK-NEXT:    sh zero, 0(a0)
-; CHECK-NEXT:    lw a0, -188(a1)
+; CHECK-NEXT:    lw a0, -304(a1)
+; CHECK-NEXT:    vadd.vi v9, v11, -1
 ; CHECK-NEXT:    vse32.v v10, (a2)
 ; CHECK-NEXT:    lw a2, -188(a1)
+; CHECK-NEXT:    vmv.v.x v8, a3
 ; CHECK-NEXT:    lw a3, 1244(a1)
-; CHECK-NEXT:    vmv.v.x v8, a0
-; CHECK-NEXT:    lw a0, 1244(a1)
 ; CHECK-NEXT:    lw a1, -304(a1)
-; CHECK-NEXT:    vmv.v.x v10, a3
-; CHECK-NEXT:    vmv.v.x v11, a5
+; CHECK-NEXT:    vmv.v.x v10, a5
+; CHECK-NEXT:    vmv.v.x v11, a0
 ; CHECK-NEXT:    vslide1down.vx v8, v8, zero
 ; CHECK-NEXT:    vslide1down.vx v10, v10, zero
 ; CHECK-NEXT:    vmin.vv v8, v10, v8
-; CHECK-NEXT:    vmv.s.x v10, a0
+; CHECK-NEXT:    vmv.s.x v10, a3
 ; CHECK-NEXT:    vslide1down.vx v11, v11, zero
 ; CHECK-NEXT:    vmin.vx v10, v10, a2
 ; CHECK-NEXT:    vmin.vx v10, v10, a1
diff --git a/llvm/test/CodeGen/RISCV/scmp.ll b/llvm/test/CodeGen/RISCV/scmp.ll
index a212714db53e0..56c876a2409d2 100644
--- a/llvm/test/CodeGen/RISCV/scmp.ll
+++ b/llvm/test/CodeGen/RISCV/scmp.ll
@@ -89,8 +89,8 @@ define i8 @scmp.8.128(i128 %x, i128 %y) nounwind {
 ; RV32I-NEXT:    lw a2, 4(a1)
 ; RV32I-NEXT:    lw a4, 8(a1)
 ; RV32I-NEXT:    lw a5, 12(a1)
-; RV32I-NEXT:    lw a6, 12(a0)
 ; RV32I-NEXT:    lw a3, 4(a0)
+; RV32I-NEXT:    lw a6, 12(a0)
 ; RV32I-NEXT:    lw a7, 8(a0)
 ; RV32I-NEXT:    beq a6, a5, .LBB4_2
 ; RV32I-NEXT:  # %bb.1:
diff --git a/llvm/test/CodeGen/RISCV/srem-vector-lkk.ll b/llvm/test/CodeGen/RISCV/srem-vector-lkk.ll
index cf65d4e0cf805..e80fa90ce2e69 100644
--- a/llvm/test/CodeGen/RISCV/srem-vector-lkk.ll
+++ b/llvm/test/CodeGen/RISCV/srem-vector-lkk.ll
@@ -562,49 +562,49 @@ define <4 x i16> @combine_srem_sdiv(<4 x i16> %x) nounwind {
 ;
 ; RV64IM-LABEL: combine_srem_sdiv:
 ; RV64IM:       # %bb.0:
-; RV64IM-NEXT:    lh a2, 16(a1)
-; RV64IM-NEXT:    lh a3, 24(a1)
-; RV64IM-NEXT:    lui a4, %hi(.LCPI2_0)
-; RV64IM-NEXT:    ld a4, %lo(.LCPI2_0)(a4)
+; RV64IM-NEXT:    lui a2, %hi(.LCPI2_0)
+; RV64IM-NEXT:    ld a2, %lo(.LCPI2_0)(a2)
+; RV64IM-NEXT:    lh a3, 16(a1)
+; RV64IM-NEXT:    lh a4, 24(a1)
 ; RV64IM-NEXT:    lh a5, 0(a1)
 ; RV64IM-NEXT:    lh a1, 8(a1)
 ; RV64IM-NEXT:    li a6, 95
-; RV64IM-NEXT:    mulh a7, a3, a4
-; RV64IM-NEXT:    mulh t0, a2, a4
-; RV64IM-NEXT:    mulh t1, a1, a4
-; RV64IM-NEXT:    mulh a4, a5, a4
-; RV64IM-NEXT:    add a7, a7, a3
-; RV64IM-NEXT:    add t0, t0, a2
+; RV64IM-NEXT:    mulh a7, a4, a2
+; RV64IM-NEXT:    mulh t0, a3, a2
+; RV64IM-NEXT:    mulh t1, a1, a2
+; RV64IM-NEXT:    mulh a2, a5, a2
+; RV64IM-NEXT:    add a7, a7, a4
+; RV64IM-NEXT:    add t0, t0, a3
 ; RV64IM-NEXT:    add t1, t1, a1
-; RV64IM-NEXT:    add a4, a4, a5
+; RV64IM-NEXT:    add a2, a2, a5
 ; RV64IM-NEXT:    srli t2, a7, 63
 ; RV64IM-NEXT:    srai a7, a7, 6
 ; RV64IM-NEXT:    srli t3, t0, 63
 ; RV64IM-NEXT:    srai t0, t0, 6
 ; RV64IM-NEXT:    srli t4, t1, 63
 ; RV64IM-NEXT:    srai t1, t1, 6
-; RV64IM-NEXT:    srli t5, a4, 63
-; RV64IM-NEXT:    srai a4, a4, 6
+; RV64IM-NEXT:    srli t5, a2, 63
+; RV64IM-NEXT:    srai a2, a2, 6
 ; RV64IM-NEXT:    add a7, a7, t2
 ; RV64IM-NEXT:    add t0, t0, t3
 ; RV64IM-NEXT:    add t1, t1, t4
-; RV64IM-NEXT:    add a4, a4, t5
+; RV64IM-NEXT:    add a2, a2, t5
 ; RV64IM-NEXT:    mul t2, a7, a6
 ; RV64IM-NEXT:    mul t3, t0, a6
 ; RV64IM-NEXT:    mul t4, t1, a6
-; RV64IM-NEXT:    mul a6, a4, a6
-; RV64IM-NEXT:    add a4, a5, a4
+; RV64IM-NEXT:    mul a6, a2, a6
+; RV64IM-NEXT:    add a2, a5, a2
 ; RV64IM-NEXT:    add a1, a1, t1
-; RV64IM-NEXT:    add a2, a2, t0
-; RV64IM-NEXT:    add a3, a3, a7
-; RV64IM-NEXT:    subw a4, a4, a6
+; RV64IM-NEXT:    add a3, a3, t0
+; RV64IM-NEXT:    add a4, a4, a7
+; RV64IM-NEXT:    subw a2, a2, a6
 ; RV64IM-NEXT:    subw a1, a1, t4
-; RV64IM-NEXT:    subw a2, a2, t3
-; RV64IM-NEXT:    subw a3, a3, t2
-; RV64IM-NEXT:    sh a4, 0(a0)
+; RV64IM-NEXT:    subw a3, a3, t3
+; RV64IM-NEXT:    subw a4, a4, t2
+; RV64IM-NEXT:    sh a2, 0(a0)
 ; RV64IM-NEXT:    sh a1, 2(a0)
-; RV64IM-NEXT:    sh a2, 4(a0)
-; RV64IM-NEXT:    sh a3, 6(a0)
+; RV64IM-NEXT:    sh a3, 4(a0)
+; RV64IM-NEXT:    sh a4, 6(a0)
 ; RV64IM-NEXT:    ret
   %1 = srem <4 x i16> %x, <i16 95, i16 95, i16 95, i16 95>
   %2 = sdiv <4 x i16> %x, <i16 95, i16 95, i16 95, i16 95>
diff --git a/llvm/test/CodeGen/RISCV/ucmp.ll b/llvm/test/CodeGen/RISCV/ucmp.ll
index 50da56fbc5951..0a400b1c04a3f 100644
--- a/llvm/test/CodeGen/RISCV/ucmp.ll
+++ b/llvm/test/CodeGen/RISCV/ucmp.ll
@@ -89,8 +89,8 @@ define i8 @ucmp.8.128(i128 %x, i128 %y) nounwind {
 ; RV32I-NEXT:    lw a2, 4(a1)
 ; RV32I-NEXT:    lw a4, 8(a1)
 ; RV32I-NEXT:    lw a5, 12(a1)
-; RV32I-NEXT:    lw a6, 12(a0)
 ; RV32I-NEXT:    lw a3, 4(a0)
+; RV32I-NEXT:    lw a6, 12(a0)
 ; RV32I-NEXT:    lw a7, 8(a0)
 ; RV32I-NEXT:    beq a6, a5, .LBB4_2
 ; RV32I-NEXT:  # %bb.1:
diff --git a/llvm/test/CodeGen/RISCV/unaligned-load-store.ll b/llvm/test/CodeGen/RISCV/unaligned-load-store.ll
index 1cdfaa5c4154b..c9c49e8f7f532 100644
--- a/llvm/test/CodeGen/RISCV/unaligned-load-store.ll
+++ b/llvm/test/CodeGen/RISCV/unaligned-load-store.ll
@@ -140,18 +140,18 @@ define i64 @load_i64(ptr %p) {
 ; RV32I-NEXT:    slli a2, a2, 16
 ; RV32I-NEXT:    slli a3, a3, 24
 ; RV32I-NEXT:    or a1, a1, a4
-; RV32I-NEXT:    lbu a4, 4(a0)
-; RV32I-NEXT:    lbu a5, 5(a0)
 ; RV32I-NEXT:    or a2, a3, a2
-; RV32I-NEXT:    lbu a3, 6(a0)
+; RV32I-NEXT:    lbu a3, 5(a0)
+; RV32I-NEXT:    lbu a4, 4(a0)
+; RV32I-NEXT:    lbu a5, 6(a0)
 ; RV32I-NEXT:    lbu a0, 7(a0)
-; RV32I-NEXT:    slli a5, a5, 8
-; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    slli a3, a3, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a4
+; RV32I-NEXT:    slli a5, a5, 16
 ; RV32I-NEXT:    slli a0, a0, 24
-; RV32I-NEXT:    or a3, a0, a3
+; RV32I-NEXT:    or a5, a0, a5
 ; RV32I-NEXT:    or a0, a2, a1
-; RV32I-NEXT:    or a1, a3, a4
+; RV32I-NEXT:    or a1, a5, a3
 ; RV32I-NEXT:    ret
 ;
 ; RV64I-LABEL: load_i64:
@@ -164,18 +164,18 @@ define i64 @load_i64(ptr %p) {
 ; RV64I-NEXT:    slli a2, a2, 16
 ; RV64I-NEXT:    slli a3, a3, 24
 ; RV64I-NEXT:    or a1, a1, a4
-; RV64I-NEXT:    lbu a4, 4(a0)
-; RV64I-NEXT:    lbu a5, 5(a0)
 ; RV64I-NEXT:    or a2, a3, a2
-; RV64I-NEXT:    lbu a3, 6(a0)
+; RV64I-NEXT:    lbu a3, 5(a0)
+; RV64I-NEXT:    lbu a4, 4(a0)
+; RV64I-NEXT:    lbu a5, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli a5, a5, 8
-; RV64I-NEXT:    or a4, a5, a4
-; RV64I-NEXT:    slli a3, a3, 16
+; RV64I-NEXT:    slli a3, a3, 8
+; RV64I-NEXT:    or a3, a3, a4
+; RV64I-NEXT:    slli a5, a5, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, a3
+; RV64I-NEXT:    or a0, a0, a5
 ; RV64I-NEXT:    or a1, a2, a1
-; RV64I-NEXT:    or a0, a0, a4
+; RV64I-NEXT:    or a0, a0, a3
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a1
 ; RV64I-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll b/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
index 988856ca70923..fd7ef7efe3efd 100644
--- a/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
+++ b/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
@@ -489,33 +489,33 @@ define <4 x i16> @combine_urem_udiv(<4 x i16> %x) nounwind {
 ;
 ; RV64IM-LABEL: combine_urem_udiv:
 ; RV64IM:       # %bb.0:
-; RV64IM-NEXT:    lhu a2, 16(a1)
-; RV64IM-NEXT:    lhu a3, 24(a1)
-; RV64IM-NEXT:    lui a4, %hi(.LCPI2_0)
-; RV64IM-NEXT:    ld a4, %lo(.LCPI2_0)(a4)
+; RV64IM-NEXT:    lui a2, %hi(.LCPI2_0)
+; RV64IM-NEXT:    ld a2, %lo(.LCPI2_0)(a2)
+; RV64IM-NEXT:    lhu a3, 16(a1)
+; RV64IM-NEXT:    lhu a4, 24(a1)
 ; RV64IM-NEXT:    lhu a5, 0(a1)
 ; RV64IM-NEXT:    lhu a1, 8(a1)
 ; RV64IM-NEXT:    li a6, 95
-; RV64IM-NEXT:    mulhu a7, a3, a4
-; RV64IM-NEXT:    mulhu t0, a2, a4
-; RV64IM-NEXT:    mulhu t1, a1, a4
-; RV64IM-NEXT:    mulhu a4, a5, a4
+; RV64IM-NEXT:    mulhu a7, a4, a2
+; RV64IM-NEXT:    mulhu t0, a3, a2
+; RV64IM-NEXT:    mulhu t1, a1, a2
+; RV64IM-NEXT:    mulhu a2, a5, a2
 ; RV64IM-NEXT:    mul t2, a7, a6
 ; RV64IM-NEXT:    mul t3, t0, a6
 ; RV64IM-NEXT:    mul t4, t1, a6
-; RV64IM-NEXT:    mul a6, a4, a6
-; RV64IM-NEXT:    add a4, a5, a4
+; RV64IM-NEXT:    mul a6, a2, a6
+; RV64IM-NEXT:    add a2, a5, a2
 ; RV64IM-NEXT:    add a1, a1, t1
-; RV64IM-NEXT:    add a2, a2, t0
-; RV64IM-NEXT:    add a3, a3, a7
-; RV64IM-NEXT:    subw a4, a4, a6
+; RV64IM-NEXT:    add a3, a3, t0
+; RV64IM-NEXT:    add a4, a4, a7
+; RV64IM-NEXT:    subw a2, a2, a6
 ; RV64IM-NEXT:    subw a1, a1, t4
-; RV64IM-NEXT:    subw a2, a2, t3
-; RV64IM-NEXT:    subw a3, a3, t2
-; RV64IM-NEXT:    sh a4, 0(a0)
+; RV64IM-NEXT:    subw a3, a3, t3
+; RV64IM-NEXT:    subw a4, a4, t2
+; RV64IM-NEXT:    sh a2, 0(a0)
 ; RV64IM-NEXT:    sh a1, 2(a0)
-; RV64IM-NEXT:    sh a2, 4(a0)
-; RV64IM-NEXT:    sh a3, 6(a0)
+; RV64IM-NEXT:    sh a3, 4(a0)
+; RV64IM-NEXT:    sh a4, 6(a0)
 ; RV64IM-NEXT:    ret
   %1 = urem <4 x i16> %x, <i16 95, i16 95, i16 95, i16 95>
   %2 = udiv <4 x i16> %x, <i16 95, i16 95, i16 95, i16 95>
diff --git a/llvm/test/CodeGen/RISCV/vararg.ll b/llvm/test/CodeGen/RISCV/vararg.ll
index 895d84b38be32..de7b256401842 100644
--- a/llvm/test/CodeGen/RISCV/vararg.ll
+++ b/llvm/test/CodeGen/RISCV/vararg.ll
@@ -209,9 +209,9 @@ define i32 @va1(ptr %fmt, ...) {
 ; LP64E-FPELIM:       # %bb.0:
 ; LP64E-FPELIM-NEXT:    addi sp, sp, -56
 ; LP64E-FPELIM-NEXT:    .cfi_def_cfa_offset 56
+; LP64E-FPELIM-NEXT:    sd a1, 16(sp)
 ; LP64E-FPELIM-NEXT:    addi a0, sp, 20
 ; LP64E-FPELIM-NEXT:    sd a0, 0(sp)
-; LP64E-FPELIM-NEXT:    sd a1, 16(sp)
 ; LP64E-FPELIM-NEXT:    lw a0, 16(sp)
 ; LP64E-FPELIM-NEXT:    sd a5, 48(sp)
 ; LP64E-FPELIM-NEXT:    sd a2, 24(sp)
@@ -231,9 +231,9 @@ define i32 @va1(ptr %fmt, ...) {
 ; LP64E-WITHFP-NEXT:    .cfi_offset s0, -64
 ; LP64E-WITHFP-NEXT:    addi s0, sp, 24
 ; LP64E-WITHFP-NEXT:    .cfi_def_cfa s0, 48
+; LP64E-WITHFP-NEXT:    sd a1, 8(s0)
 ; LP64E-WITHFP-NEXT:    addi a0, s0, 12
 ; LP64E-WITHFP-NEXT:    sd a0, -24(s0)
-; LP64E-WITHFP-NEXT:    sd a1, 8(s0)
 ; LP64E-WITHFP-NEXT:    lw a0, 8(s0)
 ; LP64E-WITHFP-NEXT:    sd a5, 40(s0)
 ; LP64E-WITHFP-NEXT:    sd a2, 16(s0)
@@ -3070,12 +3070,12 @@ define i32 @va_large_stack(ptr %fmt, ...) {
 ; LP64E-FPELIM-NEXT:    sub sp, sp, a0
 ; LP64E-FPELIM-NEXT:    .cfi_def_cfa_offset 100000064
 ; LP64E-FPELIM-NEXT:    lui a0, 24414
-; LP64E-FPELIM-NEXT:    addiw a0, a0, 284
 ; LP64E-FPELIM-NEXT:    add a0, sp, a0
-; LP64E-FPELIM-NEXT:    sd a0, 8(sp)
+; LP64E-FPELIM-NEXT:    sd a1, 280(a0)
 ; LP64E-FPELIM-NEXT:    lui a0, 24414
+; LP64E-FPELIM-NEXT:    addiw a0, a0, 284
 ; LP64E-FPELIM-NEXT:    add a0, sp, a0
-; LP64E-FPELIM-NEXT:    sd a1, 280(a0)
+; LP64E-FPELIM-NEXT:    sd a0, 8(sp)
 ; LP64E-FPELIM-NEXT:    lui a0, 24414
 ; LP64E-FPELIM-NEXT:    add a0, sp, a0
 ; LP64E-FPELIM-NEXT:    lw a0, 280(a0)
@@ -3110,11 +3110,11 @@ define i32 @va_large_stack(ptr %fmt, ...) {
 ; LP64E-WITHFP-NEXT:    lui a0, 24414
 ; LP64E-WITHFP-NEXT:    addiw a0, a0, -1704
 ; LP64E-WITHFP-NEXT:    sub sp, sp, a0
-; LP64E-WITHFP-NEXT:    addi a0, s0, 12
-; LP64E-WITHFP-NEXT:    lui a6, 24414
-; LP64E-WITHFP-NEXT:    sub a6, s0, a6
-; LP64E-WITHFP-NEXT:    sd a0, -288(a6)
 ; LP64E-WITHFP-NEXT:    sd a1, 8(s0)
+; LP64E-WITHFP-NEXT:    addi a0, s0, 12
+; LP64E-WITHFP-NEXT:    lui a1, 24414
+; LP64E-WITHFP-NEXT:    sub a1, s0, a1
+; LP64E-WITHFP-NEXT:    sd a0, -288(a1)
 ; LP64E-WITHFP-NEXT:    lw a0, 8(s0)
 ; LP64E-WITHFP-NEXT:    sd a5, 40(s0)
 ; LP64E-WITHFP-NEXT:    sd a2, 16(s0)
diff --git a/llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll b/llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll
index 437b7e557718c..09b2eeb19a69c 100644
--- a/llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll
@@ -37,16 +37,16 @@ define void @lshr_4bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a0, a3, a0
-; RV32I-NEXT:    lbu a3, 0(a1)
-; RV32I-NEXT:    lbu a6, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a3, 1(a1)
+; RV32I-NEXT:    lbu a5, 0(a1)
+; RV32I-NEXT:    lbu a6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a6, a6, 8
-; RV32I-NEXT:    or a3, a6, a3
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a5
+; RV32I-NEXT:    slli a6, a6, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a6
 ; RV32I-NEXT:    or a0, a4, a0
 ; RV32I-NEXT:    or a1, a1, a3
 ; RV32I-NEXT:    slli a1, a1, 3
@@ -101,16 +101,16 @@ define void @shl_4bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a0, a3, a0
-; RV32I-NEXT:    lbu a3, 0(a1)
-; RV32I-NEXT:    lbu a6, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a3, 1(a1)
+; RV32I-NEXT:    lbu a5, 0(a1)
+; RV32I-NEXT:    lbu a6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a6, a6, 8
-; RV32I-NEXT:    or a3, a6, a3
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a5
+; RV32I-NEXT:    slli a6, a6, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a6
 ; RV32I-NEXT:    or a0, a4, a0
 ; RV32I-NEXT:    or a1, a1, a3
 ; RV32I-NEXT:    slli a1, a1, 3
@@ -165,16 +165,16 @@ define void @ashr_4bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a0, a3, a0
-; RV32I-NEXT:    lbu a3, 0(a1)
-; RV32I-NEXT:    lbu a6, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a3, 1(a1)
+; RV32I-NEXT:    lbu a5, 0(a1)
+; RV32I-NEXT:    lbu a6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a6, a6, 8
-; RV32I-NEXT:    or a3, a6, a3
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a5
+; RV32I-NEXT:    slli a6, a6, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a6
 ; RV32I-NEXT:    or a0, a4, a0
 ; RV32I-NEXT:    or a1, a1, a3
 ; RV32I-NEXT:    slli a1, a1, 3
@@ -224,20 +224,20 @@ define void @lshr_8bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t2, t2, 24
 ; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t1, 1(a1)
-; RV64I-NEXT:    or t0, t2, t0
+; RV64I-NEXT:    or a7, t2, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t1, 0(a1)
 ; RV64I-NEXT:    lbu t2, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or a7, t1, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
 ; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t2
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a0, a0, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    slli a1, a1, 3
 ; RV64I-NEXT:    slli a4, a4, 35
@@ -271,16 +271,16 @@ define void @lshr_8bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a3, a3, a6
-; RV32I-NEXT:    lbu a6, 0(a1)
-; RV32I-NEXT:    lbu a7, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a5, 1(a1)
+; RV32I-NEXT:    lbu a6, 0(a1)
+; RV32I-NEXT:    lbu a7, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a7, a7, 8
-; RV32I-NEXT:    or a6, a7, a6
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a5, a5, 8
+; RV32I-NEXT:    or a6, a5, a6
+; RV32I-NEXT:    slli a7, a7, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a7
 ; RV32I-NEXT:    or a5, a4, a3
 ; RV32I-NEXT:    or a4, a1, a6
 ; RV32I-NEXT:    slli a4, a4, 3
@@ -360,20 +360,20 @@ define void @shl_8bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t2, t2, 24
 ; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t1, 1(a1)
-; RV64I-NEXT:    or t0, t2, t0
+; RV64I-NEXT:    or a7, t2, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t1, 0(a1)
 ; RV64I-NEXT:    lbu t2, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or a7, t1, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
 ; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t2
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a0, a0, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    slli a1, a1, 3
 ; RV64I-NEXT:    slli a4, a4, 35
@@ -407,16 +407,16 @@ define void @shl_8bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a3, a3, a6
-; RV32I-NEXT:    lbu a6, 0(a1)
-; RV32I-NEXT:    lbu a7, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a5, 1(a1)
+; RV32I-NEXT:    lbu a6, 0(a1)
+; RV32I-NEXT:    lbu a7, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a7, a7, 8
-; RV32I-NEXT:    or a6, a7, a6
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a5, a5, 8
+; RV32I-NEXT:    or a6, a5, a6
+; RV32I-NEXT:    slli a7, a7, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a7
 ; RV32I-NEXT:    or a5, a4, a3
 ; RV32I-NEXT:    or a4, a1, a6
 ; RV32I-NEXT:    slli a4, a4, 3
@@ -496,20 +496,20 @@ define void @ashr_8bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t2, t2, 24
 ; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t1, 1(a1)
-; RV64I-NEXT:    or t0, t2, t0
+; RV64I-NEXT:    or a7, t2, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t1, 0(a1)
 ; RV64I-NEXT:    lbu t2, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or a7, t1, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
 ; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t2
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a0, a0, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    slli a1, a1, 3
 ; RV64I-NEXT:    slli a4, a4, 35
@@ -540,16 +540,16 @@ define void @ashr_8bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    lbu a5, 6(a0)
 ; RV32I-NEXT:    lbu a6, 7(a0)
 ; RV32I-NEXT:    slli a3, a3, 8
-; RV32I-NEXT:    lbu a7, 0(a1)
-; RV32I-NEXT:    lbu t0, 1(a1)
 ; RV32I-NEXT:    or a3, a3, a4
-; RV32I-NEXT:    lbu a4, 2(a1)
+; RV32I-NEXT:    lbu a4, 1(a1)
+; RV32I-NEXT:    lbu a7, 0(a1)
+; RV32I-NEXT:    lbu t0, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t0, t0, 8
-; RV32I-NEXT:    or a7, t0, a7
-; RV32I-NEXT:    slli a4, a4, 16
+; RV32I-NEXT:    slli a4, a4, 8
+; RV32I-NEXT:    or a7, a4, a7
+; RV32I-NEXT:    slli t0, t0, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a4
+; RV32I-NEXT:    or a1, a1, t0
 ; RV32I-NEXT:    slli a4, a5, 16
 ; RV32I-NEXT:    slli a5, a6, 24
 ; RV32I-NEXT:    or a4, a5, a4
@@ -633,20 +633,20 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t2, 1(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t2, 0(a1)
 ; RV64I-NEXT:    lbu t3, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a4, t1, a5
-; RV64I-NEXT:    or a5, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a5, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a1, a1, 3
 ; RV64I-NEXT:    slli a6, a5, 35
@@ -667,20 +667,20 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli a7, a7, 16
 ; RV64I-NEXT:    slli t0, t0, 24
 ; RV64I-NEXT:    or a6, a6, t1
-; RV64I-NEXT:    lbu t1, 4(a0)
-; RV64I-NEXT:    lbu t2, 5(a0)
 ; RV64I-NEXT:    or a7, t0, a7
-; RV64I-NEXT:    lbu t0, 6(a0)
+; RV64I-NEXT:    lbu t0, 5(a0)
+; RV64I-NEXT:    lbu t1, 4(a0)
+; RV64I-NEXT:    lbu t2, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or t1, t2, t1
-; RV64I-NEXT:    slli t0, t0, 16
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
+; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, t2
 ; RV64I-NEXT:    or a6, a7, a6
 ; RV64I-NEXT:    not a7, a4
 ; RV64I-NEXT:    slli a5, a5, 1
-; RV64I-NEXT:    or a0, a0, t1
+; RV64I-NEXT:    or a0, a0, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a6
 ; RV64I-NEXT:    srl a0, a0, a4
@@ -787,27 +787,27 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    andi a1, a1, 12
 ; RV32I-NEXT:    add a1, t2, a1
 ; RV32I-NEXT:    andi a3, a0, 24
-; RV32I-NEXT:    lw a4, 0(a1)
-; RV32I-NEXT:    lw a5, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
 ; RV32I-NEXT:    xori a3, a3, 31
+; RV32I-NEXT:    lw a4, 4(a1)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a6, 0(a1)
 ; RV32I-NEXT:    lw a1, 12(a1)
-; RV32I-NEXT:    srl a7, a5, a0
-; RV32I-NEXT:    slli t0, a6, 1
-; RV32I-NEXT:    srl a4, a4, a0
-; RV32I-NEXT:    slli a5, a5, 1
+; RV32I-NEXT:    srl a7, a4, a0
+; RV32I-NEXT:    slli t0, a5, 1
 ; RV32I-NEXT:    srl a6, a6, a0
+; RV32I-NEXT:    slli a4, a4, 1
+; RV32I-NEXT:    srl a5, a5, a0
 ; RV32I-NEXT:    slli t1, a1, 1
 ; RV32I-NEXT:    srl a0, a1, a0
 ; RV32I-NEXT:    sll a1, t0, a3
-; RV32I-NEXT:    sll a5, a5, a3
+; RV32I-NEXT:    sll a4, a4, a3
 ; RV32I-NEXT:    sll a3, t1, a3
 ; RV32I-NEXT:    srli t0, a0, 16
 ; RV32I-NEXT:    srli t1, a0, 24
 ; RV32I-NEXT:    srli t2, a0, 8
 ; RV32I-NEXT:    or a1, a7, a1
-; RV32I-NEXT:    or a5, a4, a5
-; RV32I-NEXT:    or a3, a6, a3
+; RV32I-NEXT:    or a4, a6, a4
+; RV32I-NEXT:    or a3, a5, a3
 ; RV32I-NEXT:    sb a0, 12(a2)
 ; RV32I-NEXT:    sb t2, 13(a2)
 ; RV32I-NEXT:    sb t0, 14(a2)
@@ -815,18 +815,18 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    srli a0, a3, 16
 ; RV32I-NEXT:    srli t0, a3, 24
 ; RV32I-NEXT:    srli a3, a3, 8
-; RV32I-NEXT:    srli t1, a5, 16
-; RV32I-NEXT:    srli t2, a5, 24
-; RV32I-NEXT:    srli a5, a5, 8
+; RV32I-NEXT:    srli t1, a4, 16
+; RV32I-NEXT:    srli t2, a4, 24
+; RV32I-NEXT:    srli a4, a4, 8
 ; RV32I-NEXT:    srli t3, a1, 16
 ; RV32I-NEXT:    srli t4, a1, 24
 ; RV32I-NEXT:    srli a1, a1, 8
-; RV32I-NEXT:    sb a6, 8(a2)
+; RV32I-NEXT:    sb a5, 8(a2)
 ; RV32I-NEXT:    sb a3, 9(a2)
 ; RV32I-NEXT:    sb a0, 10(a2)
 ; RV32I-NEXT:    sb t0, 11(a2)
-; RV32I-NEXT:    sb a4, 0(a2)
-; RV32I-NEXT:    sb a5, 1(a2)
+; RV32I-NEXT:    sb a6, 0(a2)
+; RV32I-NEXT:    sb a4, 1(a2)
 ; RV32I-NEXT:    sb t1, 2(a2)
 ; RV32I-NEXT:    sb t2, 3(a2)
 ; RV32I-NEXT:    sb a7, 4(a2)
@@ -872,20 +872,20 @@ define void @lshr_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t2, 1(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t2, 0(a1)
 ; RV64I-NEXT:    lbu t3, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a4, t1, a5
-; RV64I-NEXT:    or a5, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a5, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a1, a1, 5
 ; RV64I-NEXT:    slli a6, a5, 37
@@ -906,20 +906,20 @@ define void @lshr_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    slli a7, a7, 16
 ; RV64I-NEXT:    slli t0, t0, 24
 ; RV64I-NEXT:    or a6, a6, t1
-; RV64I-NEXT:    lbu t1, 4(a0)
-; RV64I-NEXT:    lbu t2, 5(a0)
 ; RV64I-NEXT:    or a7, t0, a7
-; RV64I-NEXT:    lbu t0, 6(a0)
+; RV64I-NEXT:    lbu t0, 5(a0)
+; RV64I-NEXT:    lbu t1, 4(a0)
+; RV64I-NEXT:    lbu t2, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or t1, t2, t1
-; RV64I-NEXT:    slli t0, t0, 16
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
+; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, t2
 ; RV64I-NEXT:    or a6, a7, a6
 ; RV64I-NEXT:    not a7, a4
 ; RV64I-NEXT:    slli a5, a5, 1
-; RV64I-NEXT:    or a0, a0, t1
+; RV64I-NEXT:    or a0, a0, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a6
 ; RV64I-NEXT:    srl a0, a0, a4
@@ -1087,20 +1087,20 @@ define void @shl_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t2, 1(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t2, 0(a1)
 ; RV64I-NEXT:    lbu t3, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a4, t1, a5
-; RV64I-NEXT:    or a5, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a5, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a1, a1, 3
 ; RV64I-NEXT:    slli a6, a5, 35
@@ -1121,20 +1121,20 @@ define void @shl_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli a7, a7, 16
 ; RV64I-NEXT:    slli t0, t0, 24
 ; RV64I-NEXT:    or a6, a6, t1
-; RV64I-NEXT:    lbu t1, 12(a0)
-; RV64I-NEXT:    lbu t2, 13(a0)
 ; RV64I-NEXT:    or a7, t0, a7
-; RV64I-NEXT:    lbu t0, 14(a0)
+; RV64I-NEXT:    lbu t0, 13(a0)
+; RV64I-NEXT:    lbu t1, 12(a0)
+; RV64I-NEXT:    lbu t2, 14(a0)
 ; RV64I-NEXT:    lbu a0, 15(a0)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or t1, t2, t1
-; RV64I-NEXT:    slli t0, t0, 16
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
+; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, t2
 ; RV64I-NEXT:    or a6, a7, a6
 ; RV64I-NEXT:    not a7, a4
 ; RV64I-NEXT:    srli a5, a5, 1
-; RV64I-NEXT:    or a0, a0, t1
+; RV64I-NEXT:    or a0, a0, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a6
 ; RV64I-NEXT:    sll a0, a0, a4
@@ -1326,20 +1326,20 @@ define void @shl_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t2, 1(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t2, 0(a1)
 ; RV64I-NEXT:    lbu t3, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a4, t1, a5
-; RV64I-NEXT:    or a5, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a5, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a1, a1, 5
 ; RV64I-NEXT:    slli a6, a5, 37
@@ -1360,20 +1360,20 @@ define void @shl_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV64I-NEXT:    slli a7, a7, 16
 ; RV64I-NEXT:    slli t0, t0, 24
 ; RV64I-NEXT:    or a6, a6, t1
-; RV64I-NEXT:    lbu t1, 12(a0)
-; RV64I-NEXT:    lbu t2, 13(a0)
 ; RV64I-NEXT:    or a7, t0, a7
-; RV64I-NEXT:    lbu t0, 14(a0)
+; RV64I-NEXT:    lbu t0, 13(a0)
+; RV64I-NEXT:    lbu t1, 12(a0)
+; RV64I-NEXT:    lbu t2, 14(a0)
 ; RV64I-NEXT:    lbu a0, 15(a0)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or t1, t2, t1
-; RV64I-NEXT:    slli t0, t0, 16
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
+; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, t2
 ; RV64I-NEXT:    or a6, a7, a6
 ; RV64I-NEXT:    not a7, a4
 ; RV64I-NEXT:    srli a5, a5, 1
-; RV64I-NEXT:    or a0, a0, t1
+; RV64I-NEXT:    or a0, a0, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a6
 ; RV64I-NEXT:    sll a0, a0, a4
@@ -1542,20 +1542,20 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t2, 1(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t2, 0(a1)
 ; RV64I-NEXT:    lbu t3, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a5, t1, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a6, a5, 32
 ; RV64I-NEXT:    slli a1, a1, 3
 ; RV64I-NEXT:    slli a7, a4, 35
@@ -1578,20 +1578,20 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli a6, a6, 16
 ; RV64I-NEXT:    slli a7, a7, 24
 ; RV64I-NEXT:    or a5, a5, t0
-; RV64I-NEXT:    lbu t0, 4(a0)
-; RV64I-NEXT:    lbu t1, 5(a0)
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 6(a0)
+; RV64I-NEXT:    lbu a7, 5(a0)
+; RV64I-NEXT:    lbu t0, 4(a0)
+; RV64I-NEXT:    lbu t1, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or t0, t1, t0
-; RV64I-NEXT:    slli a7, a7, 16
+; RV64I-NEXT:    slli a7, a7, 8
+; RV64I-NEXT:    or a7, a7, t0
+; RV64I-NEXT:    slli t1, t1, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, a7
+; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a5, a6, a5
 ; RV64I-NEXT:    not a6, a3
 ; RV64I-NEXT:    slli a4, a4, 1
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, a7
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a5
 ; RV64I-NEXT:    srl a0, a0, a3
@@ -1665,17 +1665,17 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli t1, t1, 8
 ; RV32I-NEXT:    or a4, t3, a4
 ; RV32I-NEXT:    or t3, t5, t4
-; RV32I-NEXT:    lbu t4, 0(a1)
-; RV32I-NEXT:    lbu t5, 1(a1)
 ; RV32I-NEXT:    or t0, t1, t0
-; RV32I-NEXT:    lbu t1, 2(a1)
+; RV32I-NEXT:    lbu t1, 1(a1)
+; RV32I-NEXT:    lbu t4, 0(a1)
+; RV32I-NEXT:    lbu t5, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t5, t5, 8
-; RV32I-NEXT:    or t4, t5, t4
-; RV32I-NEXT:    slli t1, t1, 16
+; RV32I-NEXT:    slli t1, t1, 8
+; RV32I-NEXT:    or t1, t1, t4
+; RV32I-NEXT:    slli t5, t5, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, t1
-; RV32I-NEXT:    mv t1, sp
+; RV32I-NEXT:    or a1, a1, t5
+; RV32I-NEXT:    mv t4, sp
 ; RV32I-NEXT:    slli t2, t2, 16
 ; RV32I-NEXT:    slli a0, a0, 24
 ; RV32I-NEXT:    or t2, a0, t2
@@ -1684,7 +1684,7 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    or a5, a7, a6
 ; RV32I-NEXT:    or a4, t3, a4
 ; RV32I-NEXT:    or a6, t2, t0
-; RV32I-NEXT:    or a1, a1, t4
+; RV32I-NEXT:    or a1, a1, t1
 ; RV32I-NEXT:    sw a0, 16(sp)
 ; RV32I-NEXT:    sw a0, 20(sp)
 ; RV32I-NEXT:    sw a0, 24(sp)
@@ -1695,29 +1695,29 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    sw a6, 12(sp)
 ; RV32I-NEXT:    slli a0, a1, 3
 ; RV32I-NEXT:    andi a1, a1, 12
-; RV32I-NEXT:    add a1, t1, a1
+; RV32I-NEXT:    add a1, t4, a1
 ; RV32I-NEXT:    andi a3, a0, 24
-; RV32I-NEXT:    lw a4, 0(a1)
-; RV32I-NEXT:    lw a5, 4(a1)
-; RV32I-NEXT:    lw a6, 8(a1)
 ; RV32I-NEXT:    xori a3, a3, 31
+; RV32I-NEXT:    lw a4, 4(a1)
+; RV32I-NEXT:    lw a5, 8(a1)
+; RV32I-NEXT:    lw a6, 0(a1)
 ; RV32I-NEXT:    lw a1, 12(a1)
-; RV32I-NEXT:    srl a7, a5, a0
-; RV32I-NEXT:    slli t0, a6, 1
-; RV32I-NEXT:    srl a4, a4, a0
-; RV32I-NEXT:    slli a5, a5, 1
+; RV32I-NEXT:    srl a7, a4, a0
+; RV32I-NEXT:    slli t0, a5, 1
 ; RV32I-NEXT:    srl a6, a6, a0
+; RV32I-NEXT:    slli a4, a4, 1
+; RV32I-NEXT:    srl a5, a5, a0
 ; RV32I-NEXT:    slli t1, a1, 1
 ; RV32I-NEXT:    sra a0, a1, a0
 ; RV32I-NEXT:    sll a1, t0, a3
-; RV32I-NEXT:    sll a5, a5, a3
+; RV32I-NEXT:    sll a4, a4, a3
 ; RV32I-NEXT:    sll a3, t1, a3
 ; RV32I-NEXT:    srli t0, a0, 16
 ; RV32I-NEXT:    srli t1, a0, 24
 ; RV32I-NEXT:    srli t2, a0, 8
 ; RV32I-NEXT:    or a1, a7, a1
-; RV32I-NEXT:    or a5, a4, a5
-; RV32I-NEXT:    or a3, a6, a3
+; RV32I-NEXT:    or a4, a6, a4
+; RV32I-NEXT:    or a3, a5, a3
 ; RV32I-NEXT:    sb a0, 12(a2)
 ; RV32I-NEXT:    sb t2, 13(a2)
 ; RV32I-NEXT:    sb t0, 14(a2)
@@ -1725,18 +1725,18 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    srli a0, a3, 16
 ; RV32I-NEXT:    srli t0, a3, 24
 ; RV32I-NEXT:    srli a3, a3, 8
-; RV32I-NEXT:    srli t1, a5, 16
-; RV32I-NEXT:    srli t2, a5, 24
-; RV32I-NEXT:    srli a5, a5, 8
+; RV32I-NEXT:    srli t1, a4, 16
+; RV32I-NEXT:    srli t2, a4, 24
+; RV32I-NEXT:    srli a4, a4, 8
 ; RV32I-NEXT:    srli t3, a1, 16
 ; RV32I-NEXT:    srli t4, a1, 24
 ; RV32I-NEXT:    srli a1, a1, 8
-; RV32I-NEXT:    sb a6, 8(a2)
+; RV32I-NEXT:    sb a5, 8(a2)
 ; RV32I-NEXT:    sb a3, 9(a2)
 ; RV32I-NEXT:    sb a0, 10(a2)
 ; RV32I-NEXT:    sb t0, 11(a2)
-; RV32I-NEXT:    sb a4, 0(a2)
-; RV32I-NEXT:    sb a5, 1(a2)
+; RV32I-NEXT:    sb a6, 0(a2)
+; RV32I-NEXT:    sb a4, 1(a2)
 ; RV32I-NEXT:    sb t1, 2(a2)
 ; RV32I-NEXT:    sb t2, 3(a2)
 ; RV32I-NEXT:    sb a7, 4(a2)
@@ -1782,20 +1782,20 @@ define void @ashr_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 0(a1)
-; RV64I-NEXT:    lbu t2, 1(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 1(a1)
+; RV64I-NEXT:    lbu t2, 0(a1)
 ; RV64I-NEXT:    lbu t3, 2(a1)
 ; RV64I-NEXT:    lbu a1, 3(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a5, t1, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a6, a5, 32
 ; RV64I-NEXT:    slli a1, a1, 5
 ; RV64I-NEXT:    slli a7, a4, 37
@@ -1818,20 +1818,20 @@ define void @ashr_16bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    slli a6, a6, 16
 ; RV64I-NEXT:    slli a7, a7, 24
 ; RV64I-NEXT:    or a5, a5, t0
-; RV64I-NEXT:    lbu t0, 4(a0)
-; RV64I-NEXT:    lbu t1, 5(a0)
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 6(a0)
+; RV64I-NEXT:    lbu a7, 5(a0)
+; RV64I-NEXT:    lbu t0, 4(a0)
+; RV64I-NEXT:    lbu t1, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or t0, t1, t0
-; RV64I-NEXT:    slli a7, a7, 16
+; RV64I-NEXT:    slli a7, a7, 8
+; RV64I-NEXT:    or a7, a7, t0
+; RV64I-NEXT:    slli t1, t1, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, a7
+; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a5, a6, a5
 ; RV64I-NEXT:    not a6, a3
 ; RV64I-NEXT:    slli a4, a4, 1
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, a7
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a5
 ; RV64I-NEXT:    srl a0, a0, a3
@@ -2065,13 +2065,13 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    sd zero, 32(sp)
 ; RV64I-NEXT:    sd zero, 40(sp)
 ; RV64I-NEXT:    sd zero, 48(sp)
@@ -2088,8 +2088,8 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -2108,22 +2108,22 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    andi a1, a1, 24
 ; RV64I-NEXT:    add a1, s6, a1
 ; RV64I-NEXT:    andi a0, a4, 56
-; RV64I-NEXT:    ld a3, 0(a1)
-; RV64I-NEXT:    ld a5, 8(a1)
+; RV64I-NEXT:    xori a5, a0, 63
+; RV64I-NEXT:    ld a3, 8(a1)
 ; RV64I-NEXT:    ld a6, 16(a1)
-; RV64I-NEXT:    xori a7, a0, 63
+; RV64I-NEXT:    ld a7, 0(a1)
 ; RV64I-NEXT:    ld t0, 24(a1)
-; RV64I-NEXT:    srl a0, a5, a4
+; RV64I-NEXT:    srl a0, a3, a4
 ; RV64I-NEXT:    slli t1, a6, 1
-; RV64I-NEXT:    srl a1, a3, a4
-; RV64I-NEXT:    slli a5, a5, 1
+; RV64I-NEXT:    srl a1, a7, a4
+; RV64I-NEXT:    slli a7, a3, 1
 ; RV64I-NEXT:    srl a3, a6, a4
 ; RV64I-NEXT:    slli a6, t0, 1
 ; RV64I-NEXT:    srl t0, t0, a4
-; RV64I-NEXT:    sll a4, t1, a7
-; RV64I-NEXT:    sll a5, a5, a7
-; RV64I-NEXT:    sll a6, a6, a7
-; RV64I-NEXT:    srli a7, t0, 56
+; RV64I-NEXT:    sll a4, t1, a5
+; RV64I-NEXT:    sll a7, a7, a5
+; RV64I-NEXT:    sll a5, a6, a5
+; RV64I-NEXT:    srli a6, t0, 56
 ; RV64I-NEXT:    srli t1, t0, 48
 ; RV64I-NEXT:    srli t2, t0, 40
 ; RV64I-NEXT:    srli t3, t0, 32
@@ -2131,40 +2131,40 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    srli t5, t0, 16
 ; RV64I-NEXT:    srli t6, t0, 8
 ; RV64I-NEXT:    or a4, a0, a4
-; RV64I-NEXT:    or a5, a1, a5
-; RV64I-NEXT:    or a6, a3, a6
+; RV64I-NEXT:    or a7, a1, a7
+; RV64I-NEXT:    or a5, a3, a5
 ; RV64I-NEXT:    sb t3, 28(a2)
 ; RV64I-NEXT:    sb t2, 29(a2)
 ; RV64I-NEXT:    sb t1, 30(a2)
-; RV64I-NEXT:    sb a7, 31(a2)
+; RV64I-NEXT:    sb a6, 31(a2)
 ; RV64I-NEXT:    sb t0, 24(a2)
 ; RV64I-NEXT:    sb t6, 25(a2)
 ; RV64I-NEXT:    sb t5, 26(a2)
 ; RV64I-NEXT:    sb t4, 27(a2)
-; RV64I-NEXT:    srli a7, a6, 56
-; RV64I-NEXT:    srli t0, a6, 48
-; RV64I-NEXT:    srli t1, a6, 40
-; RV64I-NEXT:    srli t2, a6, 32
-; RV64I-NEXT:    srli t3, a6, 24
-; RV64I-NEXT:    srli t4, a6, 16
-; RV64I-NEXT:    srli a6, a6, 8
-; RV64I-NEXT:    srli t5, a5, 56
-; RV64I-NEXT:    srli t6, a5, 48
-; RV64I-NEXT:    srli s0, a5, 40
-; RV64I-NEXT:    srli s1, a5, 32
-; RV64I-NEXT:    srli s2, a5, 24
-; RV64I-NEXT:    srli s3, a5, 16
+; RV64I-NEXT:    srli a6, a5, 56
+; RV64I-NEXT:    srli t0, a5, 48
+; RV64I-NEXT:    srli t1, a5, 40
+; RV64I-NEXT:    srli t2, a5, 32
+; RV64I-NEXT:    srli t3, a5, 24
+; RV64I-NEXT:    srli t4, a5, 16
 ; RV64I-NEXT:    srli a5, a5, 8
+; RV64I-NEXT:    srli t5, a7, 56
+; RV64I-NEXT:    srli t6, a7, 48
+; RV64I-NEXT:    srli s0, a7, 40
+; RV64I-NEXT:    srli s1, a7, 32
+; RV64I-NEXT:    srli s2, a7, 24
+; RV64I-NEXT:    srli s3, a7, 16
+; RV64I-NEXT:    srli a7, a7, 8
 ; RV64I-NEXT:    srli s4, a4, 56
 ; RV64I-NEXT:    srli s5, a4, 48
 ; RV64I-NEXT:    srli s6, a4, 40
 ; RV64I-NEXT:    sb t2, 20(a2)
 ; RV64I-NEXT:    sb t1, 21(a2)
 ; RV64I-NEXT:    sb t0, 22(a2)
-; RV64I-NEXT:    sb a7, 23(a2)
-; RV64I-NEXT:    srli a7, a4, 32
+; RV64I-NEXT:    sb a6, 23(a2)
+; RV64I-NEXT:    srli a6, a4, 32
 ; RV64I-NEXT:    sb a3, 16(a2)
-; RV64I-NEXT:    sb a6, 17(a2)
+; RV64I-NEXT:    sb a5, 17(a2)
 ; RV64I-NEXT:    sb t4, 18(a2)
 ; RV64I-NEXT:    sb t3, 19(a2)
 ; RV64I-NEXT:    srli a3, a4, 24
@@ -2172,19 +2172,19 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    sb s0, 5(a2)
 ; RV64I-NEXT:    sb t6, 6(a2)
 ; RV64I-NEXT:    sb t5, 7(a2)
-; RV64I-NEXT:    srli a6, a4, 16
+; RV64I-NEXT:    srli a5, a4, 16
 ; RV64I-NEXT:    srli a4, a4, 8
 ; RV64I-NEXT:    sb a1, 0(a2)
-; RV64I-NEXT:    sb a5, 1(a2)
+; RV64I-NEXT:    sb a7, 1(a2)
 ; RV64I-NEXT:    sb s3, 2(a2)
 ; RV64I-NEXT:    sb s2, 3(a2)
-; RV64I-NEXT:    sb a7, 12(a2)
+; RV64I-NEXT:    sb a6, 12(a2)
 ; RV64I-NEXT:    sb s6, 13(a2)
 ; RV64I-NEXT:    sb s5, 14(a2)
 ; RV64I-NEXT:    sb s4, 15(a2)
 ; RV64I-NEXT:    sb a0, 8(a2)
 ; RV64I-NEXT:    sb a4, 9(a2)
-; RV64I-NEXT:    sb a6, 10(a2)
+; RV64I-NEXT:    sb a5, 10(a2)
 ; RV64I-NEXT:    sb a3, 11(a2)
 ; RV64I-NEXT:    ld s0, 152(sp) # 8-byte Folded Reload
 ; RV64I-NEXT:    ld s1, 144(sp) # 8-byte Folded Reload
@@ -2543,13 +2543,13 @@ define void @lshr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    sd zero, 32(sp)
 ; RV64I-NEXT:    sd zero, 40(sp)
 ; RV64I-NEXT:    sd zero, 48(sp)
@@ -2566,8 +2566,8 @@ define void @lshr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -2587,24 +2587,24 @@ define void @lshr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    andi a1, a1, 24
 ; RV64I-NEXT:    andi a0, a3, 32
 ; RV64I-NEXT:    add a1, s6, a1
-; RV64I-NEXT:    ld a4, 0(a1)
+; RV64I-NEXT:    xori a4, a0, 63
 ; RV64I-NEXT:    ld a5, 8(a1)
 ; RV64I-NEXT:    ld a6, 16(a1)
-; RV64I-NEXT:    xori a7, a0, 63
+; RV64I-NEXT:    ld a7, 0(a1)
 ; RV64I-NEXT:    ld t0, 24(a1)
 ; RV64I-NEXT:    srl a0, a5, a3
 ; RV64I-NEXT:    slli t1, a6, 1
-; RV64I-NEXT:    srl a1, a4, a3
+; RV64I-NEXT:    srl a1, a7, a3
 ; RV64I-NEXT:    slli a5, a5, 1
-; RV64I-NEXT:    srl a4, a6, a3
-; RV64I-NEXT:    slli a6, t0, 1
+; RV64I-NEXT:    srl a6, a6, a3
+; RV64I-NEXT:    slli a7, t0, 1
 ; RV64I-NEXT:    srl a3, t0, a3
-; RV64I-NEXT:    sll t0, t1, a7
-; RV64I-NEXT:    sll a5, a5, a7
-; RV64I-NEXT:    sll a6, a6, a7
-; RV64I-NEXT:    srli a7, a4, 24
-; RV64I-NEXT:    srli t1, a4, 16
-; RV64I-NEXT:    srli t2, a4, 8
+; RV64I-NEXT:    sll t0, t1, a4
+; RV64I-NEXT:    sll a5, a5, a4
+; RV64I-NEXT:    sll a4, a7, a4
+; RV64I-NEXT:    srli a7, a6, 24
+; RV64I-NEXT:    srli t1, a6, 16
+; RV64I-NEXT:    srli t2, a6, 8
 ; RV64I-NEXT:    srli t3, a3, 56
 ; RV64I-NEXT:    srli t4, a3, 48
 ; RV64I-NEXT:    srli t5, a3, 40
@@ -2616,12 +2616,12 @@ define void @lshr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    srli s4, a1, 16
 ; RV64I-NEXT:    srli s5, a1, 8
 ; RV64I-NEXT:    srli s6, a0, 24
-; RV64I-NEXT:    or a6, a4, a6
-; RV64I-NEXT:    sb a4, 16(a2)
+; RV64I-NEXT:    or a4, a6, a4
+; RV64I-NEXT:    sb a6, 16(a2)
 ; RV64I-NEXT:    sb t2, 17(a2)
 ; RV64I-NEXT:    sb t1, 18(a2)
 ; RV64I-NEXT:    sb a7, 19(a2)
-; RV64I-NEXT:    srli a4, a0, 16
+; RV64I-NEXT:    srli a6, a0, 16
 ; RV64I-NEXT:    sb t6, 28(a2)
 ; RV64I-NEXT:    sb t5, 29(a2)
 ; RV64I-NEXT:    sb t4, 30(a2)
@@ -2639,12 +2639,12 @@ define void @lshr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    sb s3, 3(a2)
 ; RV64I-NEXT:    sb a0, 8(a2)
 ; RV64I-NEXT:    sb a7, 9(a2)
-; RV64I-NEXT:    sb a4, 10(a2)
+; RV64I-NEXT:    sb a6, 10(a2)
 ; RV64I-NEXT:    sb s6, 11(a2)
-; RV64I-NEXT:    srli a0, a6, 56
-; RV64I-NEXT:    srli a1, a6, 48
-; RV64I-NEXT:    srli a3, a6, 40
-; RV64I-NEXT:    srli a4, a6, 32
+; RV64I-NEXT:    srli a0, a4, 56
+; RV64I-NEXT:    srli a1, a4, 48
+; RV64I-NEXT:    srli a3, a4, 40
+; RV64I-NEXT:    srli a4, a4, 32
 ; RV64I-NEXT:    srli a6, a5, 56
 ; RV64I-NEXT:    srli a7, a5, 48
 ; RV64I-NEXT:    srli t1, a5, 40
@@ -2797,13 +2797,13 @@ define void @lshr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV32I-NEXT:    sw t0, 12(sp)
 ; RV32I-NEXT:    sw t1, 16(sp)
 ; RV32I-NEXT:    sw a5, 20(sp)
-; RV32I-NEXT:    lw a6, 16(t6)
-; RV32I-NEXT:    lw a5, 20(t6)
-; RV32I-NEXT:    lw a7, 24(t6)
 ; RV32I-NEXT:    lw a1, 0(t6)
 ; RV32I-NEXT:    lw a0, 4(t6)
 ; RV32I-NEXT:    lw a4, 8(t6)
 ; RV32I-NEXT:    lw a3, 12(t6)
+; RV32I-NEXT:    lw a7, 24(t6)
+; RV32I-NEXT:    lw a5, 20(t6)
+; RV32I-NEXT:    lw a6, 16(t6)
 ; RV32I-NEXT:    lw t0, 28(t6)
 ; RV32I-NEXT:    srli t1, a7, 24
 ; RV32I-NEXT:    srli t2, a7, 16
@@ -3197,13 +3197,13 @@ define void @lshr_32bytes_dwordOff(ptr %src.ptr, ptr %dwordOff.ptr, ptr %dst) no
 ; RV32I-NEXT:    sw t0, 12(sp)
 ; RV32I-NEXT:    sw t1, 16(sp)
 ; RV32I-NEXT:    sw a5, 20(sp)
-; RV32I-NEXT:    lw a6, 16(t6)
-; RV32I-NEXT:    lw a5, 20(t6)
-; RV32I-NEXT:    lw a7, 24(t6)
 ; RV32I-NEXT:    lw a1, 0(t6)
 ; RV32I-NEXT:    lw a0, 4(t6)
 ; RV32I-NEXT:    lw a4, 8(t6)
 ; RV32I-NEXT:    lw a3, 12(t6)
+; RV32I-NEXT:    lw a7, 24(t6)
+; RV32I-NEXT:    lw a5, 20(t6)
+; RV32I-NEXT:    lw a6, 16(t6)
 ; RV32I-NEXT:    lw t0, 28(t6)
 ; RV32I-NEXT:    srli t1, a7, 24
 ; RV32I-NEXT:    srli t2, a7, 16
@@ -3380,13 +3380,13 @@ define void @shl_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    sd zero, 0(sp)
 ; RV64I-NEXT:    sd zero, 8(sp)
 ; RV64I-NEXT:    sd zero, 16(sp)
@@ -3403,8 +3403,8 @@ define void @shl_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -3858,13 +3858,13 @@ define void @shl_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    sd zero, 0(sp)
 ; RV64I-NEXT:    sd zero, 8(sp)
 ; RV64I-NEXT:    sd zero, 16(sp)
@@ -3881,8 +3881,8 @@ define void @shl_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -4112,13 +4112,13 @@ define void @shl_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) nounw
 ; RV32I-NEXT:    sw t0, 44(sp)
 ; RV32I-NEXT:    sw t1, 48(sp)
 ; RV32I-NEXT:    sw a5, 52(sp)
-; RV32I-NEXT:    lw a6, 16(t2)
-; RV32I-NEXT:    lw a5, 20(t2)
-; RV32I-NEXT:    lw a7, 24(t2)
 ; RV32I-NEXT:    lw a1, 0(t2)
 ; RV32I-NEXT:    lw a0, 4(t2)
 ; RV32I-NEXT:    lw a4, 8(t2)
 ; RV32I-NEXT:    lw a3, 12(t2)
+; RV32I-NEXT:    lw a7, 24(t2)
+; RV32I-NEXT:    lw a5, 20(t2)
+; RV32I-NEXT:    lw a6, 16(t2)
 ; RV32I-NEXT:    lw t0, 28(t2)
 ; RV32I-NEXT:    srli t1, a7, 24
 ; RV32I-NEXT:    srli t2, a7, 16
@@ -4512,13 +4512,13 @@ define void @shl_32bytes_dwordOff(ptr %src.ptr, ptr %dwordOff.ptr, ptr %dst) nou
 ; RV32I-NEXT:    sw t0, 44(sp)
 ; RV32I-NEXT:    sw t1, 48(sp)
 ; RV32I-NEXT:    sw a5, 52(sp)
-; RV32I-NEXT:    lw a6, 16(t2)
-; RV32I-NEXT:    lw a5, 20(t2)
-; RV32I-NEXT:    lw a7, 24(t2)
 ; RV32I-NEXT:    lw a1, 0(t2)
 ; RV32I-NEXT:    lw a0, 4(t2)
 ; RV32I-NEXT:    lw a4, 8(t2)
 ; RV32I-NEXT:    lw a3, 12(t2)
+; RV32I-NEXT:    lw a7, 24(t2)
+; RV32I-NEXT:    lw a5, 20(t2)
+; RV32I-NEXT:    lw a6, 16(t2)
 ; RV32I-NEXT:    lw t0, 28(t2)
 ; RV32I-NEXT:    srli t1, a7, 24
 ; RV32I-NEXT:    srli t2, a7, 16
@@ -4695,13 +4695,13 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    slli s7, s7, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, s7
@@ -4714,8 +4714,8 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -4739,22 +4739,22 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    andi a1, a1, 24
 ; RV64I-NEXT:    add a1, s6, a1
 ; RV64I-NEXT:    andi a0, a4, 56
-; RV64I-NEXT:    ld a3, 0(a1)
-; RV64I-NEXT:    ld a5, 8(a1)
+; RV64I-NEXT:    xori a5, a0, 63
+; RV64I-NEXT:    ld a3, 8(a1)
 ; RV64I-NEXT:    ld a6, 16(a1)
-; RV64I-NEXT:    xori a7, a0, 63
+; RV64I-NEXT:    ld a7, 0(a1)
 ; RV64I-NEXT:    ld t0, 24(a1)
-; RV64I-NEXT:    srl a0, a5, a4
+; RV64I-NEXT:    srl a0, a3, a4
 ; RV64I-NEXT:    slli t1, a6, 1
-; RV64I-NEXT:    srl a1, a3, a4
-; RV64I-NEXT:    slli a5, a5, 1
+; RV64I-NEXT:    srl a1, a7, a4
+; RV64I-NEXT:    slli a7, a3, 1
 ; RV64I-NEXT:    srl a3, a6, a4
 ; RV64I-NEXT:    slli a6, t0, 1
 ; RV64I-NEXT:    sra t0, t0, a4
-; RV64I-NEXT:    sll a4, t1, a7
-; RV64I-NEXT:    sll a5, a5, a7
-; RV64I-NEXT:    sll a6, a6, a7
-; RV64I-NEXT:    srli a7, t0, 56
+; RV64I-NEXT:    sll a4, t1, a5
+; RV64I-NEXT:    sll a7, a7, a5
+; RV64I-NEXT:    sll a5, a6, a5
+; RV64I-NEXT:    srli a6, t0, 56
 ; RV64I-NEXT:    srli t1, t0, 48
 ; RV64I-NEXT:    srli t2, t0, 40
 ; RV64I-NEXT:    srli t3, t0, 32
@@ -4762,40 +4762,40 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    srli t5, t0, 16
 ; RV64I-NEXT:    srli t6, t0, 8
 ; RV64I-NEXT:    or a4, a0, a4
-; RV64I-NEXT:    or a5, a1, a5
-; RV64I-NEXT:    or a6, a3, a6
+; RV64I-NEXT:    or a7, a1, a7
+; RV64I-NEXT:    or a5, a3, a5
 ; RV64I-NEXT:    sb t3, 28(a2)
 ; RV64I-NEXT:    sb t2, 29(a2)
 ; RV64I-NEXT:    sb t1, 30(a2)
-; RV64I-NEXT:    sb a7, 31(a2)
+; RV64I-NEXT:    sb a6, 31(a2)
 ; RV64I-NEXT:    sb t0, 24(a2)
 ; RV64I-NEXT:    sb t6, 25(a2)
 ; RV64I-NEXT:    sb t5, 26(a2)
 ; RV64I-NEXT:    sb t4, 27(a2)
-; RV64I-NEXT:    srli a7, a6, 56
-; RV64I-NEXT:    srli t0, a6, 48
-; RV64I-NEXT:    srli t1, a6, 40
-; RV64I-NEXT:    srli t2, a6, 32
-; RV64I-NEXT:    srli t3, a6, 24
-; RV64I-NEXT:    srli t4, a6, 16
-; RV64I-NEXT:    srli a6, a6, 8
-; RV64I-NEXT:    srli t5, a5, 56
-; RV64I-NEXT:    srli t6, a5, 48
-; RV64I-NEXT:    srli s0, a5, 40
-; RV64I-NEXT:    srli s1, a5, 32
-; RV64I-NEXT:    srli s2, a5, 24
-; RV64I-NEXT:    srli s3, a5, 16
+; RV64I-NEXT:    srli a6, a5, 56
+; RV64I-NEXT:    srli t0, a5, 48
+; RV64I-NEXT:    srli t1, a5, 40
+; RV64I-NEXT:    srli t2, a5, 32
+; RV64I-NEXT:    srli t3, a5, 24
+; RV64I-NEXT:    srli t4, a5, 16
 ; RV64I-NEXT:    srli a5, a5, 8
+; RV64I-NEXT:    srli t5, a7, 56
+; RV64I-NEXT:    srli t6, a7, 48
+; RV64I-NEXT:    srli s0, a7, 40
+; RV64I-NEXT:    srli s1, a7, 32
+; RV64I-NEXT:    srli s2, a7, 24
+; RV64I-NEXT:    srli s3, a7, 16
+; RV64I-NEXT:    srli a7, a7, 8
 ; RV64I-NEXT:    srli s4, a4, 56
 ; RV64I-NEXT:    srli s5, a4, 48
 ; RV64I-NEXT:    srli s6, a4, 40
 ; RV64I-NEXT:    sb t2, 20(a2)
 ; RV64I-NEXT:    sb t1, 21(a2)
 ; RV64I-NEXT:    sb t0, 22(a2)
-; RV64I-NEXT:    sb a7, 23(a2)
-; RV64I-NEXT:    srli a7, a4, 32
+; RV64I-NEXT:    sb a6, 23(a2)
+; RV64I-NEXT:    srli a6, a4, 32
 ; RV64I-NEXT:    sb a3, 16(a2)
-; RV64I-NEXT:    sb a6, 17(a2)
+; RV64I-NEXT:    sb a5, 17(a2)
 ; RV64I-NEXT:    sb t4, 18(a2)
 ; RV64I-NEXT:    sb t3, 19(a2)
 ; RV64I-NEXT:    srli a3, a4, 24
@@ -4803,19 +4803,19 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    sb s0, 5(a2)
 ; RV64I-NEXT:    sb t6, 6(a2)
 ; RV64I-NEXT:    sb t5, 7(a2)
-; RV64I-NEXT:    srli a6, a4, 16
+; RV64I-NEXT:    srli a5, a4, 16
 ; RV64I-NEXT:    srli a4, a4, 8
 ; RV64I-NEXT:    sb a1, 0(a2)
-; RV64I-NEXT:    sb a5, 1(a2)
+; RV64I-NEXT:    sb a7, 1(a2)
 ; RV64I-NEXT:    sb s3, 2(a2)
 ; RV64I-NEXT:    sb s2, 3(a2)
-; RV64I-NEXT:    sb a7, 12(a2)
+; RV64I-NEXT:    sb a6, 12(a2)
 ; RV64I-NEXT:    sb s6, 13(a2)
 ; RV64I-NEXT:    sb s5, 14(a2)
 ; RV64I-NEXT:    sb s4, 15(a2)
 ; RV64I-NEXT:    sb a0, 8(a2)
 ; RV64I-NEXT:    sb a4, 9(a2)
-; RV64I-NEXT:    sb a6, 10(a2)
+; RV64I-NEXT:    sb a5, 10(a2)
 ; RV64I-NEXT:    sb a3, 11(a2)
 ; RV64I-NEXT:    ld s0, 152(sp) # 8-byte Folded Reload
 ; RV64I-NEXT:    ld s1, 144(sp) # 8-byte Folded Reload
@@ -5175,13 +5175,13 @@ define void @ashr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    slli s7, s7, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, s7
@@ -5194,8 +5194,8 @@ define void @ashr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -5220,24 +5220,24 @@ define void @ashr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    andi a1, a1, 24
 ; RV64I-NEXT:    andi a0, a3, 32
 ; RV64I-NEXT:    add a1, s6, a1
-; RV64I-NEXT:    ld a4, 0(a1)
+; RV64I-NEXT:    xori a4, a0, 63
 ; RV64I-NEXT:    ld a5, 8(a1)
 ; RV64I-NEXT:    ld a6, 16(a1)
-; RV64I-NEXT:    xori a7, a0, 63
+; RV64I-NEXT:    ld a7, 0(a1)
 ; RV64I-NEXT:    ld t0, 24(a1)
 ; RV64I-NEXT:    srl a0, a5, a3
 ; RV64I-NEXT:    slli t1, a6, 1
-; RV64I-NEXT:    srl a1, a4, a3
+; RV64I-NEXT:    srl a1, a7, a3
 ; RV64I-NEXT:    slli a5, a5, 1
-; RV64I-NEXT:    srl a4, a6, a3
-; RV64I-NEXT:    slli a6, t0, 1
+; RV64I-NEXT:    srl a6, a6, a3
+; RV64I-NEXT:    slli a7, t0, 1
 ; RV64I-NEXT:    sra a3, t0, a3
-; RV64I-NEXT:    sll t0, t1, a7
-; RV64I-NEXT:    sll a5, a5, a7
-; RV64I-NEXT:    sll a6, a6, a7
-; RV64I-NEXT:    srli a7, a4, 24
-; RV64I-NEXT:    srli t1, a4, 16
-; RV64I-NEXT:    srli t2, a4, 8
+; RV64I-NEXT:    sll t0, t1, a4
+; RV64I-NEXT:    sll a5, a5, a4
+; RV64I-NEXT:    sll a4, a7, a4
+; RV64I-NEXT:    srli a7, a6, 24
+; RV64I-NEXT:    srli t1, a6, 16
+; RV64I-NEXT:    srli t2, a6, 8
 ; RV64I-NEXT:    srli t3, a3, 56
 ; RV64I-NEXT:    srli t4, a3, 48
 ; RV64I-NEXT:    srli t5, a3, 40
@@ -5249,12 +5249,12 @@ define void @ashr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    srli s4, a1, 16
 ; RV64I-NEXT:    srli s5, a1, 8
 ; RV64I-NEXT:    srli s6, a0, 24
-; RV64I-NEXT:    or a6, a4, a6
-; RV64I-NEXT:    sb a4, 16(a2)
+; RV64I-NEXT:    or a4, a6, a4
+; RV64I-NEXT:    sb a6, 16(a2)
 ; RV64I-NEXT:    sb t2, 17(a2)
 ; RV64I-NEXT:    sb t1, 18(a2)
 ; RV64I-NEXT:    sb a7, 19(a2)
-; RV64I-NEXT:    srli a4, a0, 16
+; RV64I-NEXT:    srli a6, a0, 16
 ; RV64I-NEXT:    sb t6, 28(a2)
 ; RV64I-NEXT:    sb t5, 29(a2)
 ; RV64I-NEXT:    sb t4, 30(a2)
@@ -5272,12 +5272,12 @@ define void @ashr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV64I-NEXT:    sb s3, 3(a2)
 ; RV64I-NEXT:    sb a0, 8(a2)
 ; RV64I-NEXT:    sb a7, 9(a2)
-; RV64I-NEXT:    sb a4, 10(a2)
+; RV64I-NEXT:    sb a6, 10(a2)
 ; RV64I-NEXT:    sb s6, 11(a2)
-; RV64I-NEXT:    srli a0, a6, 56
-; RV64I-NEXT:    srli a1, a6, 48
-; RV64I-NEXT:    srli a3, a6, 40
-; RV64I-NEXT:    srli a4, a6, 32
+; RV64I-NEXT:    srli a0, a4, 56
+; RV64I-NEXT:    srli a1, a4, 48
+; RV64I-NEXT:    srli a3, a4, 40
+; RV64I-NEXT:    srli a4, a4, 32
 ; RV64I-NEXT:    srli a6, a5, 56
 ; RV64I-NEXT:    srli a7, a5, 48
 ; RV64I-NEXT:    srli t1, a5, 40
@@ -5431,13 +5431,13 @@ define void @ashr_32bytes_wordOff(ptr %src.ptr, ptr %wordOff.ptr, ptr %dst) noun
 ; RV32I-NEXT:    sw t0, 12(sp)
 ; RV32I-NEXT:    sw t1, 16(sp)
 ; RV32I-NEXT:    sw a5, 20(sp)
-; RV32I-NEXT:    lw a6, 16(s6)
-; RV32I-NEXT:    lw a5, 20(s6)
-; RV32I-NEXT:    lw a7, 24(s6)
 ; RV32I-NEXT:    lw a1, 0(s6)
 ; RV32I-NEXT:    lw a0, 4(s6)
 ; RV32I-NEXT:    lw a4, 8(s6)
 ; RV32I-NEXT:    lw a3, 12(s6)
+; RV32I-NEXT:    lw a7, 24(s6)
+; RV32I-NEXT:    lw a5, 20(s6)
+; RV32I-NEXT:    lw a6, 16(s6)
 ; RV32I-NEXT:    lw t0, 28(s6)
 ; RV32I-NEXT:    srli t1, a7, 24
 ; RV32I-NEXT:    srli t2, a7, 16
@@ -5833,13 +5833,13 @@ define void @ashr_32bytes_dwordOff(ptr %src.ptr, ptr %dwordOff.ptr, ptr %dst) no
 ; RV32I-NEXT:    sw t0, 12(sp)
 ; RV32I-NEXT:    sw t1, 16(sp)
 ; RV32I-NEXT:    sw a5, 20(sp)
-; RV32I-NEXT:    lw a6, 16(s6)
-; RV32I-NEXT:    lw a5, 20(s6)
-; RV32I-NEXT:    lw a7, 24(s6)
 ; RV32I-NEXT:    lw a1, 0(s6)
 ; RV32I-NEXT:    lw a0, 4(s6)
 ; RV32I-NEXT:    lw a4, 8(s6)
 ; RV32I-NEXT:    lw a3, 12(s6)
+; RV32I-NEXT:    lw a7, 24(s6)
+; RV32I-NEXT:    lw a5, 20(s6)
+; RV32I-NEXT:    lw a6, 16(s6)
 ; RV32I-NEXT:    lw t0, 28(s6)
 ; RV32I-NEXT:    srli t1, a7, 24
 ; RV32I-NEXT:    srli t2, a7, 16
diff --git a/llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll b/llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll
index b2c130c2d7c10..cd7f30d8f5898 100644
--- a/llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll
@@ -36,16 +36,16 @@ define void @lshr_4bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a0, a3, a0
-; RV32I-NEXT:    lbu a3, 0(a1)
-; RV32I-NEXT:    lbu a6, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a3, 1(a1)
+; RV32I-NEXT:    lbu a5, 0(a1)
+; RV32I-NEXT:    lbu a6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a6, a6, 8
-; RV32I-NEXT:    or a3, a6, a3
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a5
+; RV32I-NEXT:    slli a6, a6, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a6
 ; RV32I-NEXT:    or a0, a4, a0
 ; RV32I-NEXT:    or a1, a1, a3
 ; RV32I-NEXT:    srl a0, a0, a1
@@ -97,16 +97,16 @@ define void @shl_4bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a0, a3, a0
-; RV32I-NEXT:    lbu a3, 0(a1)
-; RV32I-NEXT:    lbu a6, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a3, 1(a1)
+; RV32I-NEXT:    lbu a5, 0(a1)
+; RV32I-NEXT:    lbu a6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a6, a6, 8
-; RV32I-NEXT:    or a3, a6, a3
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a5
+; RV32I-NEXT:    slli a6, a6, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a6
 ; RV32I-NEXT:    or a0, a4, a0
 ; RV32I-NEXT:    or a1, a1, a3
 ; RV32I-NEXT:    sll a0, a0, a1
@@ -158,16 +158,16 @@ define void @ashr_4bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a0, a3, a0
-; RV32I-NEXT:    lbu a3, 0(a1)
-; RV32I-NEXT:    lbu a6, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a3, 1(a1)
+; RV32I-NEXT:    lbu a5, 0(a1)
+; RV32I-NEXT:    lbu a6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a6, a6, 8
-; RV32I-NEXT:    or a3, a6, a3
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a3, a3, 8
+; RV32I-NEXT:    or a3, a3, a5
+; RV32I-NEXT:    slli a6, a6, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a6
 ; RV32I-NEXT:    or a0, a4, a0
 ; RV32I-NEXT:    or a1, a1, a3
 ; RV32I-NEXT:    sra a0, a0, a1
@@ -215,20 +215,20 @@ define void @lshr_8bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t2, t2, 24
 ; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 4(a1)
-; RV64I-NEXT:    lbu t1, 5(a1)
-; RV64I-NEXT:    or t0, t2, t0
+; RV64I-NEXT:    or a7, t2, t0
+; RV64I-NEXT:    lbu t0, 5(a1)
+; RV64I-NEXT:    lbu t1, 4(a1)
 ; RV64I-NEXT:    lbu t2, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or a7, t1, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
 ; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t2
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a0, a0, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    slli a1, a1, 32
 ; RV64I-NEXT:    or a0, a0, a3
@@ -261,16 +261,16 @@ define void @lshr_8bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a3, a3, a6
-; RV32I-NEXT:    lbu a6, 0(a1)
-; RV32I-NEXT:    lbu a7, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a5, 1(a1)
+; RV32I-NEXT:    lbu a6, 0(a1)
+; RV32I-NEXT:    lbu a7, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a7, a7, 8
-; RV32I-NEXT:    or a6, a7, a6
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a5, a5, 8
+; RV32I-NEXT:    or a6, a5, a6
+; RV32I-NEXT:    slli a7, a7, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a7
 ; RV32I-NEXT:    or a5, a4, a3
 ; RV32I-NEXT:    or a4, a1, a6
 ; RV32I-NEXT:    addi a3, a4, -32
@@ -348,20 +348,20 @@ define void @shl_8bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t2, t2, 24
 ; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 4(a1)
-; RV64I-NEXT:    lbu t1, 5(a1)
-; RV64I-NEXT:    or t0, t2, t0
+; RV64I-NEXT:    or a7, t2, t0
+; RV64I-NEXT:    lbu t0, 5(a1)
+; RV64I-NEXT:    lbu t1, 4(a1)
 ; RV64I-NEXT:    lbu t2, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or a7, t1, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
 ; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t2
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a0, a0, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    slli a1, a1, 32
 ; RV64I-NEXT:    or a0, a0, a3
@@ -394,16 +394,16 @@ define void @shl_8bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a4, a4, 16
 ; RV32I-NEXT:    slli a5, a5, 24
 ; RV32I-NEXT:    or a3, a3, a6
-; RV32I-NEXT:    lbu a6, 0(a1)
-; RV32I-NEXT:    lbu a7, 1(a1)
 ; RV32I-NEXT:    or a4, a5, a4
-; RV32I-NEXT:    lbu a5, 2(a1)
+; RV32I-NEXT:    lbu a5, 1(a1)
+; RV32I-NEXT:    lbu a6, 0(a1)
+; RV32I-NEXT:    lbu a7, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli a7, a7, 8
-; RV32I-NEXT:    or a6, a7, a6
-; RV32I-NEXT:    slli a5, a5, 16
+; RV32I-NEXT:    slli a5, a5, 8
+; RV32I-NEXT:    or a6, a5, a6
+; RV32I-NEXT:    slli a7, a7, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a5
+; RV32I-NEXT:    or a1, a1, a7
 ; RV32I-NEXT:    or a5, a4, a3
 ; RV32I-NEXT:    or a4, a1, a6
 ; RV32I-NEXT:    addi a3, a4, -32
@@ -481,20 +481,20 @@ define void @ashr_8bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t2, t2, 24
 ; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 4(a1)
-; RV64I-NEXT:    lbu t1, 5(a1)
-; RV64I-NEXT:    or t0, t2, t0
+; RV64I-NEXT:    or a7, t2, t0
+; RV64I-NEXT:    lbu t0, 5(a1)
+; RV64I-NEXT:    lbu t1, 4(a1)
 ; RV64I-NEXT:    lbu t2, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or a7, t1, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
 ; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t2
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a0, a0, a5
-; RV64I-NEXT:    or a4, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a4, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    slli a1, a1, 32
 ; RV64I-NEXT:    or a0, a0, a3
@@ -524,16 +524,16 @@ define void @ashr_8bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    lbu a5, 6(a0)
 ; RV32I-NEXT:    lbu a6, 7(a0)
 ; RV32I-NEXT:    slli a3, a3, 8
-; RV32I-NEXT:    lbu a7, 0(a1)
-; RV32I-NEXT:    lbu t0, 1(a1)
 ; RV32I-NEXT:    or a3, a3, a4
-; RV32I-NEXT:    lbu a4, 2(a1)
+; RV32I-NEXT:    lbu a4, 1(a1)
+; RV32I-NEXT:    lbu a7, 0(a1)
+; RV32I-NEXT:    lbu t0, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t0, t0, 8
-; RV32I-NEXT:    or a7, t0, a7
-; RV32I-NEXT:    slli a4, a4, 16
+; RV32I-NEXT:    slli a4, a4, 8
+; RV32I-NEXT:    or a7, a4, a7
+; RV32I-NEXT:    slli t0, t0, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, a4
+; RV32I-NEXT:    or a1, a1, t0
 ; RV32I-NEXT:    slli a4, a5, 16
 ; RV32I-NEXT:    slli a5, a6, 24
 ; RV32I-NEXT:    or a4, a5, a4
@@ -615,20 +615,20 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 4(a1)
-; RV64I-NEXT:    lbu t2, 5(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 5(a1)
+; RV64I-NEXT:    lbu t2, 4(a1)
 ; RV64I-NEXT:    lbu t3, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a4, t1, a5
-; RV64I-NEXT:    or a6, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a6, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a1, a1, 32
 ; RV64I-NEXT:    or a5, a4, a3
@@ -648,20 +648,20 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli a7, a7, 16
 ; RV64I-NEXT:    slli t0, t0, 24
 ; RV64I-NEXT:    or a6, a6, t1
-; RV64I-NEXT:    lbu t1, 4(a0)
-; RV64I-NEXT:    lbu t2, 5(a0)
 ; RV64I-NEXT:    or a7, t0, a7
-; RV64I-NEXT:    lbu t0, 6(a0)
+; RV64I-NEXT:    lbu t0, 5(a0)
+; RV64I-NEXT:    lbu t1, 4(a0)
+; RV64I-NEXT:    lbu t2, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or t1, t2, t1
-; RV64I-NEXT:    slli t0, t0, 16
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
+; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, t2
 ; RV64I-NEXT:    or a6, a7, a6
 ; RV64I-NEXT:    not a7, a4
 ; RV64I-NEXT:    slli a5, a5, 1
-; RV64I-NEXT:    or a0, a0, t1
+; RV64I-NEXT:    or a0, a0, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a6
 ; RV64I-NEXT:    srl a0, a0, a4
@@ -740,20 +740,20 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a0, a0, 24
 ; RV32I-NEXT:    or t3, t4, t3
 ; RV32I-NEXT:    or a6, t1, a6
-; RV32I-NEXT:    lbu t1, 0(a1)
-; RV32I-NEXT:    lbu t4, 1(a1)
 ; RV32I-NEXT:    or a0, a0, t2
-; RV32I-NEXT:    lbu t2, 2(a1)
+; RV32I-NEXT:    lbu t1, 1(a1)
+; RV32I-NEXT:    lbu t2, 0(a1)
+; RV32I-NEXT:    lbu t4, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t4, t4, 8
-; RV32I-NEXT:    or t1, t4, t1
+; RV32I-NEXT:    slli t1, t1, 8
+; RV32I-NEXT:    or t1, t1, t2
 ; RV32I-NEXT:    sw zero, 16(sp)
 ; RV32I-NEXT:    sw zero, 20(sp)
 ; RV32I-NEXT:    sw zero, 24(sp)
 ; RV32I-NEXT:    sw zero, 28(sp)
-; RV32I-NEXT:    slli t2, t2, 16
+; RV32I-NEXT:    slli t4, t4, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, t2
+; RV32I-NEXT:    or a1, a1, t4
 ; RV32I-NEXT:    mv t2, sp
 ; RV32I-NEXT:    or a3, a4, a3
 ; RV32I-NEXT:    or a4, t0, a7
@@ -767,28 +767,28 @@ define void @lshr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    srli a0, a1, 3
 ; RV32I-NEXT:    andi a3, a1, 31
 ; RV32I-NEXT:    andi a0, a0, 12
-; RV32I-NEXT:    add a0, t2, a0
-; RV32I-NEXT:    lw a4, 0(a0)
-; RV32I-NEXT:    lw a5, 4(a0)
-; RV32I-NEXT:    lw a6, 8(a0)
 ; RV32I-NEXT:    xori a3, a3, 31
+; RV32I-NEXT:    add a0, t2, a0
+; RV32I-NEXT:    lw a4, 4(a0)
+; RV32I-NEXT:    lw a5, 8(a0)
+; RV32I-NEXT:    lw a6, 0(a0)
 ; RV32I-NEXT:    lw a0, 12(a0)
-; RV32I-NEXT:    srl a7, a5, a1
-; RV32I-NEXT:    slli t0, a6, 1
-; RV32I-NEXT:    srl a4, a4, a1
-; RV32I-NEXT:    slli a5, a5, 1
+; RV32I-NEXT:    srl a7, a4, a1
+; RV32I-NEXT:    slli t0, a5, 1
 ; RV32I-NEXT:    srl a6, a6, a1
+; RV32I-NEXT:    slli a4, a4, 1
+; RV32I-NEXT:    srl a5, a5, a1
 ; RV32I-NEXT:    slli t1, a0, 1
 ; RV32I-NEXT:    srl a0, a0, a1
 ; RV32I-NEXT:    sll a1, t0, a3
-; RV32I-NEXT:    sll a5, a5, a3
+; RV32I-NEXT:    sll a4, a4, a3
 ; RV32I-NEXT:    sll a3, t1, a3
 ; RV32I-NEXT:    srli t0, a0, 16
 ; RV32I-NEXT:    srli t1, a0, 24
 ; RV32I-NEXT:    srli t2, a0, 8
 ; RV32I-NEXT:    or a1, a7, a1
-; RV32I-NEXT:    or a4, a4, a5
-; RV32I-NEXT:    or a3, a6, a3
+; RV32I-NEXT:    or a4, a6, a4
+; RV32I-NEXT:    or a3, a5, a3
 ; RV32I-NEXT:    sb a0, 12(a2)
 ; RV32I-NEXT:    sb t2, 13(a2)
 ; RV32I-NEXT:    sb t0, 14(a2)
@@ -851,20 +851,20 @@ define void @shl_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 4(a1)
-; RV64I-NEXT:    lbu t2, 5(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 5(a1)
+; RV64I-NEXT:    lbu t2, 4(a1)
 ; RV64I-NEXT:    lbu t3, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a4, t1, a5
-; RV64I-NEXT:    or a6, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a6, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a1, a1, 32
 ; RV64I-NEXT:    or a5, a4, a3
@@ -884,20 +884,20 @@ define void @shl_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli a7, a7, 16
 ; RV64I-NEXT:    slli t0, t0, 24
 ; RV64I-NEXT:    or a6, a6, t1
-; RV64I-NEXT:    lbu t1, 12(a0)
-; RV64I-NEXT:    lbu t2, 13(a0)
 ; RV64I-NEXT:    or a7, t0, a7
-; RV64I-NEXT:    lbu t0, 14(a0)
+; RV64I-NEXT:    lbu t0, 13(a0)
+; RV64I-NEXT:    lbu t1, 12(a0)
+; RV64I-NEXT:    lbu t2, 14(a0)
 ; RV64I-NEXT:    lbu a0, 15(a0)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or t1, t2, t1
-; RV64I-NEXT:    slli t0, t0, 16
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t1
+; RV64I-NEXT:    slli t2, t2, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, t2
 ; RV64I-NEXT:    or a6, a7, a6
 ; RV64I-NEXT:    not a7, a4
 ; RV64I-NEXT:    srli a5, a5, 1
-; RV64I-NEXT:    or a0, a0, t1
+; RV64I-NEXT:    or a0, a0, t0
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a6
 ; RV64I-NEXT:    sll a0, a0, a4
@@ -976,20 +976,20 @@ define void @shl_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli a0, a0, 24
 ; RV32I-NEXT:    or t3, t4, t3
 ; RV32I-NEXT:    or a6, t1, a6
-; RV32I-NEXT:    lbu t1, 0(a1)
-; RV32I-NEXT:    lbu t4, 1(a1)
 ; RV32I-NEXT:    or a0, a0, t2
-; RV32I-NEXT:    lbu t2, 2(a1)
+; RV32I-NEXT:    lbu t1, 1(a1)
+; RV32I-NEXT:    lbu t2, 0(a1)
+; RV32I-NEXT:    lbu t4, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t4, t4, 8
-; RV32I-NEXT:    or t1, t4, t1
+; RV32I-NEXT:    slli t1, t1, 8
+; RV32I-NEXT:    or t1, t1, t2
 ; RV32I-NEXT:    sw zero, 0(sp)
 ; RV32I-NEXT:    sw zero, 4(sp)
 ; RV32I-NEXT:    sw zero, 8(sp)
 ; RV32I-NEXT:    sw zero, 12(sp)
-; RV32I-NEXT:    slli t2, t2, 16
+; RV32I-NEXT:    slli t4, t4, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, t2
+; RV32I-NEXT:    or a1, a1, t4
 ; RV32I-NEXT:    addi t2, sp, 16
 ; RV32I-NEXT:    or a3, a4, a3
 ; RV32I-NEXT:    or a4, t0, a7
@@ -1087,20 +1087,20 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli t3, t3, 24
 ; RV64I-NEXT:    or t1, t2, t1
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 4(a1)
-; RV64I-NEXT:    lbu t2, 5(a1)
-; RV64I-NEXT:    or t0, t3, t0
+; RV64I-NEXT:    or a7, t3, t0
+; RV64I-NEXT:    lbu t0, 5(a1)
+; RV64I-NEXT:    lbu t2, 4(a1)
 ; RV64I-NEXT:    lbu t3, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli t2, t2, 8
-; RV64I-NEXT:    or a7, t2, a7
+; RV64I-NEXT:    slli t0, t0, 8
+; RV64I-NEXT:    or t0, t0, t2
 ; RV64I-NEXT:    slli t3, t3, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, t3
 ; RV64I-NEXT:    or a3, a4, a3
 ; RV64I-NEXT:    or a5, t1, a5
-; RV64I-NEXT:    or a6, t0, a6
-; RV64I-NEXT:    or a1, a1, a7
+; RV64I-NEXT:    or a6, a7, a6
+; RV64I-NEXT:    or a1, a1, t0
 ; RV64I-NEXT:    slli a4, a5, 32
 ; RV64I-NEXT:    slli a1, a1, 32
 ; RV64I-NEXT:    or a4, a4, a3
@@ -1122,20 +1122,20 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli a6, a6, 16
 ; RV64I-NEXT:    slli a7, a7, 24
 ; RV64I-NEXT:    or a5, a5, t0
-; RV64I-NEXT:    lbu t0, 4(a0)
-; RV64I-NEXT:    lbu t1, 5(a0)
 ; RV64I-NEXT:    or a6, a7, a6
-; RV64I-NEXT:    lbu a7, 6(a0)
+; RV64I-NEXT:    lbu a7, 5(a0)
+; RV64I-NEXT:    lbu t0, 4(a0)
+; RV64I-NEXT:    lbu t1, 6(a0)
 ; RV64I-NEXT:    lbu a0, 7(a0)
-; RV64I-NEXT:    slli t1, t1, 8
-; RV64I-NEXT:    or t0, t1, t0
-; RV64I-NEXT:    slli a7, a7, 16
+; RV64I-NEXT:    slli a7, a7, 8
+; RV64I-NEXT:    or a7, a7, t0
+; RV64I-NEXT:    slli t1, t1, 16
 ; RV64I-NEXT:    slli a0, a0, 24
-; RV64I-NEXT:    or a0, a0, a7
+; RV64I-NEXT:    or a0, a0, t1
 ; RV64I-NEXT:    or a5, a6, a5
 ; RV64I-NEXT:    not a6, a3
 ; RV64I-NEXT:    slli a4, a4, 1
-; RV64I-NEXT:    or a0, a0, t0
+; RV64I-NEXT:    or a0, a0, a7
 ; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    or a0, a0, a5
 ; RV64I-NEXT:    srl a0, a0, a3
@@ -1209,26 +1209,26 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    slli t1, t1, 8
 ; RV32I-NEXT:    or a4, t3, a4
 ; RV32I-NEXT:    or t3, t5, t4
-; RV32I-NEXT:    lbu t4, 0(a1)
-; RV32I-NEXT:    lbu t5, 1(a1)
 ; RV32I-NEXT:    or t0, t1, t0
-; RV32I-NEXT:    lbu t1, 2(a1)
+; RV32I-NEXT:    lbu t1, 1(a1)
+; RV32I-NEXT:    lbu t4, 0(a1)
+; RV32I-NEXT:    lbu t5, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t5, t5, 8
-; RV32I-NEXT:    or t4, t5, t4
-; RV32I-NEXT:    slli t1, t1, 16
+; RV32I-NEXT:    slli t1, t1, 8
+; RV32I-NEXT:    or t1, t1, t4
+; RV32I-NEXT:    slli t5, t5, 16
 ; RV32I-NEXT:    slli a1, a1, 24
-; RV32I-NEXT:    or a1, a1, t1
+; RV32I-NEXT:    or a1, a1, t5
 ; RV32I-NEXT:    or a3, a5, a3
 ; RV32I-NEXT:    mv a5, sp
 ; RV32I-NEXT:    slli t2, t2, 16
 ; RV32I-NEXT:    slli a0, a0, 24
-; RV32I-NEXT:    or t1, a0, t2
+; RV32I-NEXT:    or t2, a0, t2
 ; RV32I-NEXT:    srai a0, a0, 31
 ; RV32I-NEXT:    or a6, a7, a6
 ; RV32I-NEXT:    or a4, t3, a4
-; RV32I-NEXT:    or a7, t1, t0
-; RV32I-NEXT:    or a1, a1, t4
+; RV32I-NEXT:    or a7, t2, t0
+; RV32I-NEXT:    or a1, a1, t1
 ; RV32I-NEXT:    sw a0, 16(sp)
 ; RV32I-NEXT:    sw a0, 20(sp)
 ; RV32I-NEXT:    sw a0, 24(sp)
@@ -1240,28 +1240,28 @@ define void @ashr_16bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    srli a0, a1, 3
 ; RV32I-NEXT:    andi a3, a1, 31
 ; RV32I-NEXT:    andi a0, a0, 12
-; RV32I-NEXT:    add a0, a5, a0
-; RV32I-NEXT:    lw a4, 0(a0)
-; RV32I-NEXT:    lw a5, 4(a0)
-; RV32I-NEXT:    lw a6, 8(a0)
 ; RV32I-NEXT:    xori a3, a3, 31
+; RV32I-NEXT:    add a0, a5, a0
+; RV32I-NEXT:    lw a4, 4(a0)
+; RV32I-NEXT:    lw a5, 8(a0)
+; RV32I-NEXT:    lw a6, 0(a0)
 ; RV32I-NEXT:    lw a0, 12(a0)
-; RV32I-NEXT:    srl a7, a5, a1
-; RV32I-NEXT:    slli t0, a6, 1
-; RV32I-NEXT:    srl a4, a4, a1
-; RV32I-NEXT:    slli a5, a5, 1
+; RV32I-NEXT:    srl a7, a4, a1
+; RV32I-NEXT:    slli t0, a5, 1
 ; RV32I-NEXT:    srl a6, a6, a1
+; RV32I-NEXT:    slli a4, a4, 1
+; RV32I-NEXT:    srl a5, a5, a1
 ; RV32I-NEXT:    slli t1, a0, 1
 ; RV32I-NEXT:    sra a0, a0, a1
 ; RV32I-NEXT:    sll a1, t0, a3
-; RV32I-NEXT:    sll a5, a5, a3
+; RV32I-NEXT:    sll a4, a4, a3
 ; RV32I-NEXT:    sll a3, t1, a3
 ; RV32I-NEXT:    srli t0, a0, 16
 ; RV32I-NEXT:    srli t1, a0, 24
 ; RV32I-NEXT:    srli t2, a0, 8
 ; RV32I-NEXT:    or a1, a7, a1
-; RV32I-NEXT:    or a4, a4, a5
-; RV32I-NEXT:    or a3, a6, a3
+; RV32I-NEXT:    or a4, a6, a4
+; RV32I-NEXT:    or a3, a5, a3
 ; RV32I-NEXT:    sb a0, 12(a2)
 ; RV32I-NEXT:    sb t2, 13(a2)
 ; RV32I-NEXT:    sb t0, 14(a2)
@@ -1392,13 +1392,13 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    slli s7, s7, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, s7
@@ -1415,8 +1415,8 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t1, s0, t6
 ; RV64I-NEXT:    or t2, s5, s1
-; RV64I-NEXT:    or t3, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t3, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a3, a3, 32
 ; RV64I-NEXT:    slli a7, a7, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -1434,23 +1434,23 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    srli a0, a1, 3
 ; RV64I-NEXT:    andi a3, a1, 63
 ; RV64I-NEXT:    andi a0, a0, 24
-; RV64I-NEXT:    add a0, a6, a0
-; RV64I-NEXT:    ld a4, 0(a0)
-; RV64I-NEXT:    ld a5, 8(a0)
-; RV64I-NEXT:    ld a6, 16(a0)
 ; RV64I-NEXT:    xori a3, a3, 63
+; RV64I-NEXT:    add a0, a6, a0
+; RV64I-NEXT:    ld a4, 8(a0)
+; RV64I-NEXT:    ld a5, 16(a0)
+; RV64I-NEXT:    ld a6, 0(a0)
 ; RV64I-NEXT:    ld a0, 24(a0)
-; RV64I-NEXT:    srl a7, a5, a1
-; RV64I-NEXT:    slli t0, a6, 1
-; RV64I-NEXT:    srl a4, a4, a1
-; RV64I-NEXT:    slli a5, a5, 1
+; RV64I-NEXT:    srl a7, a4, a1
+; RV64I-NEXT:    slli t0, a5, 1
 ; RV64I-NEXT:    srl a6, a6, a1
+; RV64I-NEXT:    slli a4, a4, 1
+; RV64I-NEXT:    srl a5, a5, a1
 ; RV64I-NEXT:    slli t1, a0, 1
 ; RV64I-NEXT:    srl t2, a0, a1
 ; RV64I-NEXT:    sll a0, t0, a3
-; RV64I-NEXT:    sll a1, a5, a3
+; RV64I-NEXT:    sll a1, a4, a3
 ; RV64I-NEXT:    sll a3, t1, a3
-; RV64I-NEXT:    srli a5, t2, 56
+; RV64I-NEXT:    srli a4, t2, 56
 ; RV64I-NEXT:    srli t0, t2, 48
 ; RV64I-NEXT:    srli t1, t2, 40
 ; RV64I-NEXT:    srli t3, t2, 32
@@ -1458,12 +1458,12 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    srli t5, t2, 16
 ; RV64I-NEXT:    srli t6, t2, 8
 ; RV64I-NEXT:    or a0, a7, a0
-; RV64I-NEXT:    or a1, a4, a1
-; RV64I-NEXT:    or a3, a6, a3
+; RV64I-NEXT:    or a1, a6, a1
+; RV64I-NEXT:    or a3, a5, a3
 ; RV64I-NEXT:    sb t3, 28(a2)
 ; RV64I-NEXT:    sb t1, 29(a2)
 ; RV64I-NEXT:    sb t0, 30(a2)
-; RV64I-NEXT:    sb a5, 31(a2)
+; RV64I-NEXT:    sb a4, 31(a2)
 ; RV64I-NEXT:    sb t2, 24(a2)
 ; RV64I-NEXT:    sb t6, 25(a2)
 ; RV64I-NEXT:    sb t5, 26(a2)
@@ -1868,13 +1868,13 @@ define void @shl_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    slli s7, s7, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, s7
@@ -1891,8 +1891,8 @@ define void @shl_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t1, s0, t6
 ; RV64I-NEXT:    or t2, s5, s1
-; RV64I-NEXT:    or t3, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t3, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a3, a3, 32
 ; RV64I-NEXT:    slli a7, a7, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -2344,13 +2344,13 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    slli s7, s7, 24
 ; RV64I-NEXT:    or s5, s6, s5
 ; RV64I-NEXT:    or s2, s3, s2
-; RV64I-NEXT:    lbu s3, 4(a1)
-; RV64I-NEXT:    lbu s6, 5(a1)
-; RV64I-NEXT:    or s4, s7, s4
+; RV64I-NEXT:    or s3, s7, s4
+; RV64I-NEXT:    lbu s4, 5(a1)
+; RV64I-NEXT:    lbu s6, 4(a1)
 ; RV64I-NEXT:    lbu s7, 6(a1)
 ; RV64I-NEXT:    lbu a1, 7(a1)
-; RV64I-NEXT:    slli s6, s6, 8
-; RV64I-NEXT:    or s3, s6, s3
+; RV64I-NEXT:    slli s4, s4, 8
+; RV64I-NEXT:    or s4, s4, s6
 ; RV64I-NEXT:    slli s7, s7, 16
 ; RV64I-NEXT:    slli a1, a1, 24
 ; RV64I-NEXT:    or a1, a1, s7
@@ -2363,8 +2363,8 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    or a0, a0, t5
 ; RV64I-NEXT:    or t0, s0, t6
 ; RV64I-NEXT:    or t1, s5, s1
-; RV64I-NEXT:    or t2, s4, s2
-; RV64I-NEXT:    or a1, a1, s3
+; RV64I-NEXT:    or t2, s3, s2
+; RV64I-NEXT:    or a1, a1, s4
 ; RV64I-NEXT:    slli a4, a4, 32
 ; RV64I-NEXT:    slli a6, a6, 32
 ; RV64I-NEXT:    slli a0, a0, 32
@@ -2387,23 +2387,23 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    srli a0, a1, 3
 ; RV64I-NEXT:    andi a3, a1, 63
 ; RV64I-NEXT:    andi a0, a0, 24
-; RV64I-NEXT:    add a0, s6, a0
-; RV64I-NEXT:    ld a4, 0(a0)
-; RV64I-NEXT:    ld a5, 8(a0)
-; RV64I-NEXT:    ld a6, 16(a0)
 ; RV64I-NEXT:    xori a3, a3, 63
+; RV64I-NEXT:    add a0, s6, a0
+; RV64I-NEXT:    ld a4, 8(a0)
+; RV64I-NEXT:    ld a5, 16(a0)
+; RV64I-NEXT:    ld a6, 0(a0)
 ; RV64I-NEXT:    ld a0, 24(a0)
-; RV64I-NEXT:    srl a7, a5, a1
-; RV64I-NEXT:    slli t0, a6, 1
-; RV64I-NEXT:    srl a4, a4, a1
-; RV64I-NEXT:    slli a5, a5, 1
+; RV64I-NEXT:    srl a7, a4, a1
+; RV64I-NEXT:    slli t0, a5, 1
 ; RV64I-NEXT:    srl a6, a6, a1
+; RV64I-NEXT:    slli a4, a4, 1
+; RV64I-NEXT:    srl a5, a5, a1
 ; RV64I-NEXT:    slli t1, a0, 1
 ; RV64I-NEXT:    sra t2, a0, a1
 ; RV64I-NEXT:    sll a0, t0, a3
-; RV64I-NEXT:    sll a1, a5, a3
+; RV64I-NEXT:    sll a1, a4, a3
 ; RV64I-NEXT:    sll a3, t1, a3
-; RV64I-NEXT:    srli a5, t2, 56
+; RV64I-NEXT:    srli a4, t2, 56
 ; RV64I-NEXT:    srli t0, t2, 48
 ; RV64I-NEXT:    srli t1, t2, 40
 ; RV64I-NEXT:    srli t3, t2, 32
@@ -2411,12 +2411,12 @@ define void @ashr_32bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; RV64I-NEXT:    srli t5, t2, 16
 ; RV64I-NEXT:    srli t6, t2, 8
 ; RV64I-NEXT:    or a0, a7, a0
-; RV64I-NEXT:    or a1, a4, a1
-; RV64I-NEXT:    or a3, a6, a3
+; RV64I-NEXT:    or a1, a6, a1
+; RV64I-NEXT:    or a3, a5, a3
 ; RV64I-NEXT:    sb t3, 28(a2)
 ; RV64I-NEXT:    sb t1, 29(a2)
 ; RV64I-NEXT:    sb t0, 30(a2)
-; RV64I-NEXT:    sb a5, 31(a2)
+; RV64I-NEXT:    sb a4, 31(a2)
 ; RV64I-NEXT:    sb t2, 24(a2)
 ; RV64I-NEXT:    sb t6, 25(a2)
 ; RV64I-NEXT:    sb t5, 26(a2)
diff --git a/llvm/test/CodeGen/RISCV/xtheadmempair.ll b/llvm/test/CodeGen/RISCV/xtheadmempair.ll
index 3525c40026064..4df61dad7d039 100644
--- a/llvm/test/CodeGen/RISCV/xtheadmempair.ll
+++ b/llvm/test/CodeGen/RISCV/xtheadmempair.ll
@@ -57,14 +57,14 @@ define i64 @lwud(ptr %a) {
 define i64 @ldd(ptr %a) {
 ; RV32XTHEADMEMPAIR-LABEL: ldd:
 ; RV32XTHEADMEMPAIR:       # %bb.0:
-; RV32XTHEADMEMPAIR-NEXT:    lw a1, 44(a0)
-; RV32XTHEADMEMPAIR-NEXT:    lw a2, 32(a0)
-; RV32XTHEADMEMPAIR-NEXT:    lw a3, 36(a0)
+; RV32XTHEADMEMPAIR-NEXT:    lw a1, 32(a0)
+; RV32XTHEADMEMPAIR-NEXT:    lw a2, 36(a0)
+; RV32XTHEADMEMPAIR-NEXT:    lw a3, 44(a0)
 ; RV32XTHEADMEMPAIR-NEXT:    lw a0, 40(a0)
-; RV32XTHEADMEMPAIR-NEXT:    add a1, a3, a1
-; RV32XTHEADMEMPAIR-NEXT:    add a0, a2, a0
-; RV32XTHEADMEMPAIR-NEXT:    sltu a2, a0, a2
-; RV32XTHEADMEMPAIR-NEXT:    add a1, a1, a2
+; RV32XTHEADMEMPAIR-NEXT:    add a2, a2, a3
+; RV32XTHEADMEMPAIR-NEXT:    add a0, a1, a0
+; RV32XTHEADMEMPAIR-NEXT:    sltu a1, a0, a1
+; RV32XTHEADMEMPAIR-NEXT:    add a1, a2, a1
 ; RV32XTHEADMEMPAIR-NEXT:    ret
 ;
 ; RV64XTHEADMEMPAIR-LABEL: ldd:



More information about the llvm-commits mailing list